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Preface 


This book is intended both as a reference book for professional econometri- 
cians and as a graduate textbook. If it is used as a textbook, the material 
contained in the book can be taught in a year-long course, as I have done at 
Stanford for many years. The prerequisites for such a course should be one 
year of calculus, one quarter or semester of matrix analysis, one year of 
intermediate statistical inference (see list of textbooks in note 1 of Chapter 
3), and, preferably, knowledge of introductory or intermediate econometrics 
(say, at the level of Johnston, 1972). This last requirement is not necessary, but 
I have found in the past that a majority of economics students who take a 
graduate course in advanced econometrics do have knowledge of introduc- 
tory or intermediate econometrics. 

The main features of the book are the following: a thorough treatment of 
classical least squares theory (Chapter 1) and generalized least squares theory 
(Chapter 6); a rigorous discussion of large sample theory (Chapters 3 and 4); a 
detailed analysis of qualitative response models (Chapter 9), censored or 
truncated regression models (Chapter 10), and Markov chain and duration 
models (Chapter 11); and a discussion of nonlinear simultaneous equations 
models (Chapter 8). 

The book presents only the fundamentals of time series analysis (Chapter 5 
and a part of Chapter 6) because there are several excellent textbooks on the 
subject (see the references cited at the beginning of Chapter 5). In contrast, the 
models I discuss in the last three chapters have been used extensively in recent 
econometric applications but have not received in any textbook as complete a 
treatment as I give them here. Some instructors may wish to supplement my 
book with a textbook in time series analysis. 

My discussion of linear simultaneous equations models (Chapter 7) is also 
brief. Those who wish to study the subject in greater detail should consult the 
references given in Chapter 7. I chose to devote more space to the discussion of 
nonlinear simultaneous equations models, which are still at an early stage of 


development and consequently have received only scant coverage in most 
textbooks. 
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In many parts of the book, and in all of Chapters 3 and 4, I have used the 
theorem-proof format and have attempted to develop all the mathematical 
results rigorously. However, it has not been my aim to present theorems in full 
mathematical generality. Because I intended this as a textbook rather than asa 
monograph, I chose assumptions that are relatively easy to understand and 
that lead to simple proofs, even in those instances where they could be relaxed. 
This will enable readers to understand the basic structure of each theorem and 
to generalize it for themselves depending on their needs and abilities. Many 
simple applications of theorems are given either in the form of examples in the 
text or in the form of exercises at the end of each chapter to bring out the 
essential points of each theorem. 

Although this is a textbook in econometrics methodology, I have included 
discussions of numerous empirical papers to illustrate the practical use of 
theoretical results. This is especially conspicuous in the last three chapters of 
the book. 


Too many people have contributed to the making of this book through the 
many revisions it has undergone to mention all their names. I am especially 
grateful to Trevor Breusch, Hidehiko Ichimura, Tom MaCurdy, Jim Powell, 
and Gene Savin for giving me valuable comments on the entire manuscript. I 
am also indebted to Carl Christ, Art Goldberger, Cheng Hsiao, Roger 
Koenker, Tony Lancaster, Chuck Manski, and Hal White for their valuable 
comments on parts of the manuscript. I am grateful to Colin Cameron, Tom 
Downes, Harry Paarsch, Aaron Han, and Choon Moon for proofreading and 
to the first three for correcting my English. In addition, Tom Downes and 
Choon Moon helped me with the preparation of the index. Dzung Pham has 
typed most of the manuscript through several revisions; her unfailing patience 
and good nature despite many hours of overtime work are much appreciated. 
David Criswell, Cathy Shimizu, and Bach-Hong Tran have also helped with 
the typing. The financial support of the National Science Foundation for the 
research that produced many of the results presented in the book is gratefully 
acknowledged. Finally, I am indebted to the editors of the Journal of Eco- 
nomic Literature for permission to include in Chapter 9 parts of my article 
entitled “Qualitative Response Models: A Survey" (Journal of Economic 
Literature 19:1483- 1536, 1981) and to North-Holland Publishing Company 
for permission to use in Chapter 10 the revised version of my article entitled 
“Tobit Models: A Survey" (Journal of Econometrics 24:3—61, 1984). 
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1 Classical Least Squares Theory 


In this chapter we shall consider the basic results of statistical inference in the 
classical linear regression model — the model in which the regressors are inde- 
pendent ofthe error term and the error term is serially uncorrelated and has a 
constant variance. This model is the starting point of the study; the models to 
be examined in later chapters are modifications of this one. 


1.1 Linear Regression Model 


In this section let us look at the reasons for studying the linear regression 
model and the method of specifying it. We shall start by defining Model 1, to 
be considered throughout the chapter. 


1.1.1 Introduction 


Consider a sequence of K random variables (y, X4, Xs, . . . , Xj), 
t—1,2,...,T. Define a T-vector y (y1, y,..., yr), a (K— 1 
vector xf*-—(X4,X4,...,Xg), and a [(K—1)XT]vector x*— 
(xf^, xf^,. . . , x¥’)’. Suppose for the sake of exposition that the joint density 


of the variables is given by f(y, x*, 0), where @ is a vector of unknown parame- 
ters. We are concerned with inference about the parameter vector @ on the 
basis of the observed vectors y and x*. 

In econometrics we are often interested in the conditional distribution of 
one set of random variables given another set of random variables; for exam- 
ple, the conditional distribution of consumption given income and the condi- 
tional distribution of quantities demanded given prices. Suppose we want to 
know the conditional distribution of y given x*. We can write the joint density 
as the product of the conditional density and the marginal density as in 


Sly, x*, 0) = f(yIx*, 0, f(x", 0;). (1.1.1) 


Regression analysis can be defined as statistical inferences on 6,. For this 
purpose we can ignore f(x*, 6,), provided there is no relationship between 6, 
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and @,. The vector y is called the vector of dependent or endogenous variables, 
and the vector x* is called the vector of independent or exogenous variables. 

In regression analysis we usually want to estimate only the first and second 
moments of the conditional distribution, rather than the whole parameter 
vector 0,. (In certain special cases the first two moments characterize 0, 
completely.) Thus we can define regression analysis as statistical inference on 
the conditional mean E(y|x*) and the conditional variance-covariance matrix 
V(y|x*). Generally, these moments are nonlinear functions of x*. However, in 
the present chapter we shall consider the special case in which E(y,|x*) is equal 
to E(y,|x*) and is a linear function of xf, and V(y|x*) is a constant times an 
identity matrix. Such a model is called the classical (or standard) linear 
regression model or the homoscedastic (meaning constant variance) linear 
regression model. Because this is the model to be studied in Chapter 1, let us 
call it simply Model 1. 


1.1.2 Model 1 
By writing x, = (1, x*^)', we can define Model 1 as follows. Assume 
V= XP t us, t=1,2,...,T7, (1.1.2) 


where y, is a scalar observable random variable, f is a K-vector of unknown 
parameters, x, is a K-vector of known constants such that XZ, x,x; is nonsin- 
gular, and u, is a scalar, unobservable, random variable (called the error term 
or the disturbance) such that Eu, = 0, Vu, = o? (another unknown parameter) 
for all ?, and Eu,u, = 0 for t + s. 

Note that we have assumed x* to be a vector of known constants. This is 
essentially equivalent to stating that we are concerned only with estimating 
the conditional distribution of y given x*. The most important assumption of 
Model 1 is the linearity of E(y,|x?); we therefore shall devote the next subsec- 
tion to a discussion of the implications ofthat assumption. We have also made 
the assumption of homoscedasticity (Vu, = o? for all t) and the assumption of 
no serial correlation (Eu,u, = 0 for t # s), not because we believe that they are 
satisfied in most applications, but because they make a convenient starting 
point. These assumptions will be removed in later chapters. 

We shall sometimes impose additional assumptions on Model 1 to obtain 
certain specific results. Notably, we shall occasionally make the assumption of 
serial independence of (u,) or the assumption that u, is normally distributed. 
In general, independence is a stronger assumption than no correlation, al- 
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though under normality the two concepts are equivalent. The additional 
assumptions will be stated whenever they are introduced into Model. 


1.1.3 Implications of Linearity 


Suppose random variables y, and x* have finite second moments and their 
variance-covariance matrix is denoted by 


Yı of OL 
AAA 
x? 05 Èn 
Then we can always write 


Y= Bot Xf yt vs (1.1.3) 


where f, = 23702, fy = Ey, — di; Xz Ext, Ev, — 0, Vv, — 01 — 012222015, 
and Ex? v, = 0. It is important to realize that Model 1 implies certain assump- 
tions that (1.1.3) does not: (1.1.3) does not generally imply linearity of 
E(y,|x*) because E(v,|x*) may not generally be zero. 

We call f, + x?' B, in (1.1.3) the best linear predictor of y, given xf because 
fy and f, can be shown to be the values of by and b, that minimize 
E(y, — by — x? ' b, X. In contrast, the conditional mean E(y,|x*) is called the 
best predictor of y, given x* because Ely, — E(y|x*)P < Ely, e(x*)l? for 
any function g. 

The reader might ask why we work with eq. (1.1.2) rather than with (1.1.3). 
The answer is that (1.1.3) is so general that it does not allow us to obtain 
interesting results. For example, whereas the natural estimators of fl, and fj, 
can be defined by replacing the moments of y, and x that characterize fj, and 
B, with their corresponding sample moments (they actually coincide with the 
least squares estimator), the mean of the estimator cannot be evaluated with- 
out specifying more about the relationship between x7 and »,. 

How restrictive is the linearity of E( yx)? It holds if y, and x are jointly 
normal or if y, and xf are both scalar dichotomous (Bernoulli) variables.! But 
the linearity may not hold for many interesting distributions. Nevertheless, 
the linear assumption is not as restrictive as it may appear at first glance 
because x* can be variables obtained by transforming the original indepen- 
dent variables in various ways. For example, if the conditional mean of y,, the 
supply of good, is a quadratic function of the price, p,, we can put 
X? = (p,, p?)’, thereby making E(y,|x*) linear. 
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1.1.4 Matrix Notation 


To facilitate the subsequent analysis, we shall write (1.1.2) in matrix notation 
as 


y=Xf+u, (1.1.4) 
where y=(Yi, Y2». >- Yr), wu-—(u,u...,up), and X= 
(x,,X2,. . . , X)’. In other words, X is the T X K matrix, the tth row of 


which is x;. The elements of the matrix X are described as 


Xu X2 ... XK 

Xn X2 e X 
X= 

Xr Xm -.-. Xx 


If we want to focus on the columns of X, we can write X= 
[Xi Xo - -> » Xx], where each x,,) is a T-vector. If there is no danger of 
confusing X; with x,, we can drop the parentheses and write simply x;. In 
matrix notation the assumptions on X and u can be stated as follows: X’ X is 
nonsingular, which is equivalent to stating rank (X) = K if T = K; Eu — 0; 
and Euu’ = c?lT,, where Ipis the T X Tidentity matrix. (Whenever the size of 
an identity matrix can be inferred from the context, we write it simply as I.) 

In the remainder of this chapter we shall no longer use the partition 
P’ = (fs, Bi); instead, the elements of f will be written as B= 
(B,, Bo... - , Bx)’. Similarly, we shall not necessarily assume that x,,) is the 
vector of ones, although in practice this is usually the case. Most of our results 
will be obtained simply on the assumption that X is a matrix of constants, 
without specifying specific values. 


1.2 Theory of Least Squares 


In this section we shall define the least squares estimator of the parameter Bin 
Model 1 and shall show that it is the best linear unbiased estimator. We shall 
also discuss estimation of the error variance g?. 


1.2.1 Definition of Least Squares Estimators of f and c? 


The least squares (LS) estimator B of the regression parameter J in Model 1 is 
defined to be the value of f that minimizes the sum of squared residuals? 
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S(B) = (y — XB)’ (y — XP) (1.2.1) 
= yy — 2y’XB + B' X'Xg. 
Putting the derivatives of S(f) with respect to f equal to 0, we have 
ô 
OS L —2X'y + 2X' Xf - 0, (1.2.2) 
ag 
where 0S/08 denotes the K-vector the ith element of which is 0S/0f,, 8; being 
the ith element of f. Solving (1.2.2) for fi gives 
B- (X'X)'X'y. (1.2.3) 
Clearly, S() attains the global minimum at B. 

Let us consider the special case K = 2 and x; = (1, x;,) and represent each 
of the 7-observations (y, Xx) by a point on the plane. Then, geometrically, 
the least squares estimates are the intercept and the slope of a line drawn in 
such a way that the sum of squares of the deviations between the points and the 
line is minimized in the direction of the y-axis. Different estimates result if the 
sum of squares of deviations is minimized in any other direction. 

Given the least squares estimator fj, we define 


à—y- Xf (1.2.4) 


and call it the vector of the least squares residuals. Using fi, we can estimate g? 
by 


a? = Tf’ à, (1.2.5) 


called the least squares estimator of 02, although the use of the term least 
Squares here is not as compelling as in the estimation of the regression 
parameters. 

Using (1.2.4), we can write 


y - Xl - = Py + My, (1.2.6) 


where P = X(X' X)! X' and M = I — P. Because ( is orthogonal to X (that is, 
ii’ X = 0), least squares estimation can be regarded as decomposing y into two 
orthogonal components: a component that can be written as a linear combi- 
nation of the column vectors of X and a component that is orthogonal to X. 
Alternatively, we can call Py the projection of y onto the space spanned by the 
column vectors of X and My the projection of y onto the space orthogonal to 
X. Theorem 14 of Appendix 1 gives the properties of a projection matrix such 
as P or M. In the special case where both y and X are two-dimensional vectors 
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Py 
Figure 1.1 Orthogonal decomposition of y 


(that is, K = 1 and T = 2), the decomposition (1.2.6) can be illustrated as in 
Figure 1.1, where the vertical and horizontal axes represent the first and 
second observations, respectively, and the arrows represent vectors. 

From (1.2.6) we obtain 


y'y 7 y'Py t y'My. (1.2.7) 


The goodness of fit of the regression of y on X can be measured by the ratio 
y’ Py/y' y, sometimes called R?. However, it is more common to define R? as 
the square of the sample correlation between y and Py: 
(y’ LPy)? 
R = n, 1.2.8 

y'Ly : y'PLPy (1.2.8) 
where L = I, — T-'W and I denotes the 7-vector of ones. If we assume one of 
the columns of X is 1 (which is usually the case), we have LP = PL. Then we 
can rewrite (1.2.8) as 


yLPLy , y'My 
y'Ly y'Ly 

Thus R? can be interpreted as a measure of the goodness of fit of the regression 

ofthe deviations of y from its mean on the deviations ofthe columns of X from 


their means. (Section 2.1.4 gives a modification of R? suggested by Theil, 
1961.) 


2- 


(1.2.9) 


1.2.2 Least Squares Estimator of a Subset of $ 


It is sometimes useful to have an explicit, formula for a subset of the least 
squares estimates f. Suppose we partition B’ = (1, B3),where fl, isa K,-vec- 
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tor and Ê is a K;-vector such that K, + K, = K. Partition X conformably as 
X = (X, , X,). Then we can write X’ X$ = X'y as 


X;X, f, + XX), = X1 y (1.2.10) 
and 

X:X, f, + XX, f, = Xv. (1.2.11) 
Solving (1.2.11) for [3 and inserting it into (1.2.10), we obtain 

B, = (X(M;X, y X:Myy, (1.2.12) 
where M; = I — X, (XX, y !X;. Similarly, 

Ê = (XM, X2) 'X;M;y, (1.2.13) 


where M, = I — X, (X:X,)"! X4. 


In Model 1 we assume that X is of full rank, an assumption that implies that 
the matrices to be inverted in (1.2.12) and (1.2.13) are both nonsingular. 
Suppose for a moment that X, is of full rank but that X, is not. In this case £, 
cannot be estimated, but fj, still can be estimated by modifying (1.2.12) as 


^ 


B, = (XiM3X,  ' XiM$y, (1.2.14) 


where M2 = I — X}(X3’ X3)! X", where the columns of X2consist of a maxi- 
mal number of linearly independent columns of X,, provided that X;M7X, is 
nonsingular. (For the more general problem. of estimating a linear combina- 
tion of the elements of fl, see Section 2.2.3.) 


1.2.3 The Mean and Variance of f and ó? 
Inserting (1.1.4) into (1.2.3), we have 
B-(X'Xy'x'y (1.2.15) 
- (XX)! X^u. 


Clearly, EB= B by the assumptions of Model 1. Using the second line of 
(1.2.15), we can derive the variance-covariance matrix of f: 


Vj = E(B — BXB — By’ (1.2.16) 
= E(X’X)'X/uu’ X(X’' XY! 
—g?(X'Xy!. 
From (1.2.3) and (1.2.4), we have fi = Mu, where M = I — X(X'X)!X'. 
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Using the properties of the projection matrix given in Theorem 14 of Appen- 
dix 1, we obtain 


Eo? = T-!Eu'Mu (1.2.17) 
= T'E tr Muu’ by Theorem 6 of Appendix | 
—Tg?ttrM 
= T-!(T— K)o? by Theorems 7 and 14 of Appendix t, 


which shows that 6? is a biased estimator of o?. We define the unbiased 
estimator of a? by 


e=(T-Ky' VG. (1.2.18) 


We shall obtain the variance of 6? later, in Section 1.3, under the additional 
assumption that u is normal. 

The quantity V can be estimated by substituting either 9? or a? (defined 
above) for the c? that appears in the right-hand side of (1.2.16). 


1.2.4 Definition of Best 


Before we prove that the least squares estimator is best linear unbiased, we 
must define the term best. First we shall define it for scalar estimators, then for 
vector estimators. 


DEFINITION 1.2.1. Let 6 and 0* be scalar estimators of a scalar parameter 0. 
The estimator @ is said to be at least as good as (or at least as efficient as) the 
estimator 6* if E(6 — 6) = E(0* — 6 for all parameter values. The estimator 
6 is said to be better (or more efficient) than the estimator 0* if Ó is at least as 
good as 0* and E(0 — 6)? < E(0* — 6) for at least one parameter value. An 
estimator is said to be best (or efficient) in a class if it is better than any other 
estimator in the class. 


The mean squared error is a reasonable criterion in many situations and is 
mathematically convenient. So, following the convention of the statistical 
literature, we have defined "better" to mean “having a smaller mean squared 
error." However, there may be situations in which a researcher wishes to use 
other criteria, such as the mean absolute error. 


DEFINITION 1.2.2. Let 6 and 6* be estimators of a vector parameter @. Let A 
and B be their respective mean squared error matrices; that is, 
A= E(0 — 0X0 — 0)' and B = E(0* — 6)(6* — 6)’. Then we say ĝis better (or 
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more efficient) than 0* if 


c(B— A)c 0 forevery vector c and every parameter value 
(1.2.19) 


and 


c'(B—A)c» 0 for at least one value of c and (1.2.20) 
at least one value of the parameter. 


This definition of better clearly coincides with Definition 1.2.1 if 0 is a scalar. 
In view of Definition 1.2.1, equivalent forms of statements (1.2.19) and 
(1.2.20) are statements (1.2.21) and (1.2.22): 


c'ÓÜ is at least as good as c’@ forevery vector c (1.2.21) 
and 

c’6 isbetterthan c'0* foratleastone value of c. (1.2.22) 
Using Theorem 4 of Appendix 1, they also can be written as 

BzA forevery parameter value (1.2.23) 
and 

BA foratleast one parameter value. (1.2.24) 


(Note that B =A means B — A is nonnegative definite and B > A means 
B — A is positive definite.) 

We shall now prove the equivalence of (1.2.20) and (1.2.24). Because the 
phrase “‘for at least one parameter value" is common to both statements, we 
shall ignore it in the following proof. First, suppose (1.2.24) is not true. Then 
B — A. Therefore c'(B — A)c = 0 for every c, a condition that implies that 
(1.2.20) is not true. Second, suppose (1.2.20) is not true. Then c’ (B — A)c = 0 
for every c and every diagonal element of B — A must be 0 (choose c to be the 
zero vector, except for 1 in the ith position). Also, the i, jth element of B — A is 
0 (choose c to be the zero vector, except for 1 in the ith and jth positions, and 
note that B — A is symmetric). Thus (1.2.24) is not true. This completes the 
proof. 

Note that replacing B # A in (1.2.24) with B > A—or making the corre- 
sponding change in (1.2.20) or (1.2.22) — is unwise because we could not then 
rank the estimator with the mean squared error matrix 
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higher than the estimator with the mean squared error matrix 


Io i 


A problem with Definition 1.2.2 (more precisely, a problem inherent in the 
comparison of vector estimates rather than in this definition) is that often it 
does not allow us to say one estimator is either better or worse than the other. 
For example, consider 


fio [20 
a=]; ] and B 3 I (1.2.25) 


Clearly, neither A 2 B nor B 2 A. In such a case one might compare the 
trace and conclude that @ is better than 0* because tr A < tr B. Another 
example is 


_|2 1 | [20 
a-[? | and »-[? J (1.2.26) 


Again, neither A 2 B nor B 2 A. If one were using the determinant as the 
criterion, one would prefer 6 over 6* because det A < det B. 

Note that B = A implies both tr B 2 tr A and det B = det A. The first 
follows from Theorem 7 and the second from Theorem 11 of Appendix 1. As 
these two examples show, neither tr B = tr A nor det B 2 det A implies 
B z A. 

Use of the trace as a criterion has an obvious intuitive appeal, inasmuch as it 
is the sum of the individual variances. Justification for the use of the determi- 
nant involves more complicated reasoning. Suppose 6~ N(0, V ), where V is 
the variance-covariance matrix of ô. Then, by Theorem 1 of Appendix 2, 
(8— 0yvV- 6 — 0) ~ x, the chi-square distribution with K degrees of free- 
dom, K being the number of elements of 8. Therefore the (1 — a)% confidence 
ellipsoid for 9 is defined by 


(0\(6 — 6) V-1(6 — 6) < x (o), (1.2.27) 


where yz (a) is the number such that P[yz = x£(o)] = a. Then the volume of 
the ellipsoid (1.2.27) is proportional to the determinant of V, as shown by 
Anderson (1958, p. 170). 

A more intuitive justification for the determinant criterion is possible for 
the case in which 8 isa two-dimensional vector. Let the mean squared error 
matrix of an estimator 0 = (6,, 0,)' be 


Classical Least Squares Theory 11 


[n | 
8;3, 85 


Suppose that 0, is known; define another estimator of 60, by 
0, = 0, + a(0; — 0;). Its mean squared error is a,, + 6282; + 2aa,, and at- 
tains the minimum value of a,, — (a?,/a,.) when a = —4,2/a;;. The larger 
d,5, the more efficient can be estimation of 0,. Because the larger a,; implies 
the smaller determinant, the preceding reasoning provides a justification for 
the determinant criterion. 

Despite these justifications, both criteria inevitably suffer from a certain 
degree of arbitrariness and therefore should be used with caution. Another 
useful scalar criterion based on the predictive efficiency ofan estimator will be 
proposed in Section 1.6. 


1.2.5 Least Squares as Best Linear Unbiased Estimator (BLUE) 


The class of linear estimators of 8 can be defined as those estimators of the 
form C'y forany T X K constant matrix C. We can further restrict the class by 
imposing the unbiasedness condition, namely, 


EC’y=8 forall f. (1.2.28) 
Inserting (1.1.4) into (1.2.28), we obtain 
C'X-I. (1.2.29) 


Clearly, the LS estimator B is a member of this class. The following theorem 
proves that LS is best of all the linear unbiased estimators. 


THEOREM 1.2.1 (Gauss-Markov). Let p*- = C^y where C is a T X K matrix 
of constants such that C' X = I. Then f is better than f* if Ê # f*. 


Proof. Because B* = fj + C’ u because of (1.2.29), we have 
VB* = EC’uu’C (1.2.30) 
=0°C'C 
= gXX'X)'! + o?[C' — (X'X)!X'][C' — (X'X)'X'].. 


The theorem follows immediately by noting that the second term of the last 
line of (1.2.30) is a nonnegative definite matrix. 


We shall now give an alternative proof, which contains an interesting point 
of its own. The class of linear unbiased estimators can be defined alternatively 
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as the class of estimators of the form (S' X) !S' y, where Sis any T X K matrix 
of constants such that S’ X is nonsingular. When it is defined this way, we call 
it the class of instrumental variable estimators (abbreviated as IV) and call the 
column vectors of S instrumental variables. The variance-covariance matrix 
of IV easily can be shown to be a(S’ X) !S' S(X'S)^!. We get LS when we put 
S =X, and the optimality of LS can be proved as follows: Because 
I — S(S’S)'S’ is nonnegative definite by Theorem 14(v) of Appendix 1, we 
have 


X'X = X’S(S’S)"'S’X. (1.2.31) 


Inverting both sides of (1.2.31) and using Theorem 17 of Appendix 1, we 
obtain the desired result: 


(X'X)! € (S/X)'S'SX'Sy-. (1.2.32) 


In the preceding analysis we were first given the least squares estimator and 
then proceeded to prove that it is best of all the linear unbiased estimators. 
Suppose now that we knew nothing about the least squares estimator and had 
to find the value of C that minimizes C' C in the matrix sense (that is, in terms 
ofa ranking of matrices based on the matrix inequality defined earlier) subject 
to the condition C' X = I. Unlike the problem of scalar minimization, cal- 
culus is of no direct use. In such a situation it is often useful to minimize the 
variance of a linear unbiased estimator of the scalar parameter p’ f, where p is 
an arbitrary K-vector of known constants. 

Let c^ y be a linear estimator of p'f. The unbiasedness condition implies 
X'c = p. Because Vc'y = o?c'c, the problem mathematically is 


Minimize c'c subject to X'c = p. (1.2.33) 
Define 
S —c'c— 2À'(X'c— p), (1.2.34) 


where 2A is a K-vector of Lagrange multipliers. Setting the derivative of S with 
respect to c equal to 0 yields 


c— X4. (1.2.35) 
Premultiplying both sides by X’ and using the constraint, we obtain 

À 7 (X' Xy !p. (1.2.36) 
Inserting (1.2.36) into (1.2.35), we conclude that the best linear unbiased 
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estimator of p' Bis p’(X’ X)"!X' y. We therefore have a constructive proof of 
Theorem 1.2.1. 


1.3 Model 1 with Normality 


In this section we shall consider Model 1 with the added assumption of the 
joint normality of u. Because no correlation implies independence under 
normality, the (4,) are now assumed to be serially independent. We shall show 
that the maximum likelihood estimator (MLE) of the regression parameter is 
identical to the least squares estimator and, using the theory of the Cramér- 
Rao lower bound, shall prove that it is the best unbiased estimator. We shall 
also discuss maximum likelihood estimation of the variance a?. 


14.3.1 Maximum Likelihood Estimator 


Under Model 1 with normality we have y ~ N(Xfig?I). Therefore the likeli- 
hood function is given by? 


L = (2n07)-T exp [-0.507?(y — XB)’ (y — Xf]. (1.3.1) 


Taking the logarithm of this equation and ignoring the terms that do not 
depend on the unknown parameters, we have 


log L = — log 0? — (y — XB - XB). (1.32) 


Evidently the maximization of log L is equivalent to the minimization of 
(y — Xf)’ (y — Xf), so we conclude that the maximum likelihood estimator of 
fis the same as the least squares estimator f obtained in Section 1.2. Putting f 
into (1.3.2) and setting the partial derivative of log L with respect to c? equal to 
0, we obtain the maximum likelihood estimator of c?: 


6? =T Â, (1.3.3) 


where i = y — Xf. This is identical to what in Section 1.2.1 we called the least 
squares estimator of c?. - 

The mean and the variance-covariance matrix of fl were obtained in Section 
1.2.3. Because linear combinations of normal random variables are normal, 
we have under the present normality assumption 


B ~ NIE? (X'X)*1]. (1.3.4) 


The mean of 6? is given in Eq. (1.2.17). We shall now derive its variance 
under the normality assumption. We have 
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E(u’ Muy = Eu’ Muu' Mu (1.3.5) 
= tr (ME[(u' Muy! ]) 
= g* tr (M[2M + (tr MyI]) 
= g*[2 tr M+ (tr MY] 
—-c*[XT — K) + (T — KY], 


where we have used Eu?=0 and Euj-3o* since u, ~ N(0, 07). 
The third equality in (1.3.5) can be shown as follows. If we write the ¢,sth 
element of M as m,,, the ith element of the matrix E(u’ Mu)uu' is given by 
ZLQEXL, m,Eu uuu. Hence it is equal to 20m, if i + j and to 204m; + 
o* XL mifi = j, from which the third equality follows. Finally, from (1.2.17) 
and (1.3.5), 


XT — K)o’ 
TB 


Another important result that we shall use later in Section 1.5, where we 
discuss tests of linear hypotheses, is that 


yV? = (1.3.6) 


~ Xie (1.3.7) 


which readily follows from Theorem 2 of Appendix 2 because M is indempo- 
tent with rank T — K. Because the variance of %4—xis 2(T — K) by Theorem 1 
of Appendix 2, (1.3.6) can be derived alternatively from (1.3.7). 


1.3.2 Cramér-Rao Lower Bound 


The Cramér-Rao lower bound gives a useful lower bound (in the matrix sense) 
for the variance-covariance matrix of an unbiased vector estimator.‘ In this 
section we shall prove a general theorem that will be applied to Model 1 with 
normality in the next two subsections. 


THEOREM 1.3.1 (Cramér-Rao). Let z be an n-component vector of random 
variables (not necessarily independent) the joint density of which is given by 
L(z, 0), where 0 is a K-component vector of parameters in some parameter 
space O. Let &(z) be an unbiased estimator of 8 with a finite variance-covar- 
iance matrix. Furthermore, assume that L(z, 0) and O(z) satisfy 


alog L _ 


( E a6 


0, 
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(B log L .8logLólogL 

) 0008’ 00 06’ ' 
ð log L ð log L 

(©) 30 o9 ^» 


La, 
(D) IEL dz =I. 


Note: ð log L/@ is a K-vector, so that 0 in assumption A isa vector of K zeroes. 
Assumption C means that the left-hand side is a positive definite matrix. The 
integral in assumption D is an 7-tuple integral over the whole domain of the 
Euclidean n-space because z is an n-vector. Finally, I in assumption D is the 
identity matrix of size K. 

Then we have for 0 c O 

~ 8 log L |7! 
vô = | -E == 
(Hz | E 5000 | 

Proof Define P= E(0— 00 —0y, Q= EĞ -— Xa log L/30'), and 

R = E(ô log L/00)(0 log 7./90^). Then 


| P 2| zo (1.3.9) 


(1.3.8) 


Q' R 
because the left-hand side is a variance-covariance matrix. Premultiply both 
sides of (1.3.9) by [I, C QR^! ] and postmultiply by [I, QR! ]'. Then we get 
P —QR^Q' 2 0, (1.3.10) 
where R^! can be defined because of assumption C. But we have 


à log d 
60’ 


o=E| 6-5 (1.3.11) 


=E E 2d by assumption A 


Tm 
DOR 


16 Advanced Econometrics 


Therefore (1.3.8) follows from (1.3.10), (1.3.11), and assumption B. 


The inverse of the lower bound, namely, — E à? log L/d000’, is called the 
information matrix. R. A. Fisher first introduced this term as a certain mea- 
sure of the information contained in the sample (see Rao, 1973, p. 329). 

The Cramér-Rao lower bound may not be attained by any unbiased esti- 
mator. If it is, we have found the best unbiased estimator. In the next section 
we shall show that the maximum likelihood estimator of the regression pa- 
rameter attains the lower bound in Model 1 with normality. 

Assumptions A, B, and D seem arbitrary at first glance. We shall now 
inquire into the significance of these assumptions. Assumptions A and B are 
equivalent to 


(A’) [Gano 


, eL = 
(B’) [Zé dz=0, 


respectively. The equivalence between assumptions A and A’ follows from 


pP Lg [oe (1.3.12) 
- flea] 
= | = dz 
The equivalence between assumptions B and B’ follows from 
reli BEA (1.3.13) 


-z[- 225.190] 
L? 00 00' L 0090' 


__ | Glog L ðlog L aL 
E | a0. a0" |+ Í 3656 ^^ 


Furthermore, assumptions A’, B’, and D are equivalent to 


" 3L p- 2 
(A") [$a afra 
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2 
(B^) ls d IET 


9000’ ^ 38] o8" 
, OL 5, 29 a 
(D^) [Sra z fiš dz, 


because the right-hand sides of assumptions A ", B”, and D’ are 0, 0, and I, 
respectively. Written thus, the three assumptions are all of the form 


a à 
5 fre 0) dz = wE 0) dz; (1.3.14) 


in other words, the operations of differentiation and integration can be inter- 
changed. In the next theorem we shall give sufficient conditions for (1.3.14) to 
hold in the case where @ is a scalar. The case for a vector @ can be treated in 
essentially the same manner, although the notation gets more complex. 


THEOREM 1.3.2. If (i) f(z, 0)/80 is continuous in 0 € O and z, where O is an 
open set, (ii) f f(z, 0) dz exists, and (iii) f|af(z, 0)/00|dz < M < œ for all 
0 € O, then (1.3.14) holds. 


Proof. We have, using assumptions (1) and (ii), 


fa, 0+h)—f(a, 0) af 
| [| E2292 36 (z, 2] dz 
< f(z, 6+ h)—f(z, 0) of 
= Í —; .- 86 (z, 0) | dz 
_ {| o _ of 
m Í 30 (z, 6*) ET (z, 0) dz, 


where 0* is between 0 and @+ h. Next, we can write the last integral as 
J = f, + fa, where A is a sufficiently large compact set in the domain of z and 
A is its complement. But we can make J , sufficiently small because of (i) and 
Jz sufficiently small because of (iii). 


1.3.3 Least Squares Estimator as Best Unbiased Estimator (BUE) 


In this section we shall show that under Model 1 with normality, the least 
squares estimator of the regression parameters f attains the Cramér-Rao lower 
bound and hence is the best unbiased estimator. Assumptions A, B, and C of 
Theorem 1.3.1 are easy to verify for our model; as for assumption D, we shall 
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verify it only for the case where fis a scalar and c? is known, assuming that the 

parameter space of f and c? is compact and does not contain c? = 0. 
Consider log L given in (1.3.2). In applying Theorem 1.3.1 to our model, we 

put 0" = (B’, o”). The first- and second-order derivatives of log L are given by 


eren y- XX). (1.3.15) 
Lr EE zn - XB)" (y - XB. (1.3.16) 
t Cou (1.3.17) 
Petr T ua qr 7 Xf)'(y — Xf), (1.3.18) 
RET y — X'Xp.. (1.3.19) 


From (1.3.15) and (1.3.16) we can immediately see that assumption A is 
satisfied. Taking the expectation of (1.3.17), (1.3.18), and (1.3.19), we obtain 


1 , 
a E (1.3.20) 
0600’ 0 T 3. 
20% 
We have 
ð log L ð log L , 
ap Op’ =EX uu’ X = Lx! X, (1.3.21) 
ô log L |? T? à I 
z| a(o?) [- 40° tan E(u'uf — 4, Eu'u (1.3.22) 
-1 
20%’ 
because E(u’ u)? = (T? + 2T) a^, and 
ô log L ð log L 
ET xoi 7 (1.3.23) 


8B (c?) 


Therefore, from (1.3.20) to (1.3.23) we can see that assumptions B and C are 
both satisfied. 
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We shall verify assumption D only for the case where f is a scalar and c? is 
known so that we can use Theorem 1.3.2, which is stated for the case ofa scalar 
parameter. Take BL as the fof that theorem. We need to check only the last 
condition of the theorem. Differentiating (1.3.1) with respect to f, we have 


LA] 
of oc? 


where we have written X as x, inasmuch asit isa vector. Therefore, by Hólder's 
inequality (see Royden, 1968, p. 113), 


[esa frea) fen mer n] 
E dea BL dy (x’y — Bx'xyL dy | . (1.3.25) 


The first integral on the right-hand side is finite because B is assumed to have a 
finite variance. The second integral on the right-hand side is also finite because 
the moments of the normal distribution are finite. Moreover, both integrals 
are uniformly bounded in the assumed parameter space. Thus the last condi- 
tion of Theorem 1.3.2 is satisfied. 

Finally, from (1.3.8) and (1.3.20) we have 


VB = cXX'Xy: (1.3.26) 


(x’y — fx'x)L, (1.3.24) 


for any unbiased B. The right-hand side of (1.3.26) is the variance-covariance 
matrix of the least squares estimator of fl, thereby proving that the least 
squares estimator is the best unbiased estimator under Model | with normal- 
ity. Unlike the result in Section 1.2.5, the result in this section is not con- 
strained by the linearity condition because the normality assumption was 
added, Nevertheless, even with the normality assumption, there may be a 
biased estimator that has a smaller average mean squared error than the least 
squares estimator, as we shall show in Section 2.2. In nonnormal situations, 
certain nonlinear estimators may be preferred to LS, as we shall see in Sec- 
tion 2.3. 


1.3.4 The Cramér-Rao Lower Bound for Unbiased Estimators of c? 


From (1.3.8) and (1.3.20) the Cramér-Rao lower bound for unbiased estima- 
tors of c? in Model 1 with normality is equal to 20^T ^!. We shall examine 
whether it is attained by the unbiased estimator 6? defined in Eq. (1.2.18). 
Using (1.2.17) and (1.3.5), we have 


2a4 


Vai- 
THK 


(1.3.27) 
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Therefore it does not attain the Cramér-Rao lower bound, although the differ- 
ence is negligible when T is large. 

We shall now show that there is a simple biased estimator of a? that has a 
smaller mean squared error than the Cramér-Rao lower bound. Define the 
class of estimators 


(1.3.28) 


where N is a positive integer. Both 6? and 67, defined in (1.2.5) and (1.2.18), 
respectively, are special cases of (1.3.28). Using (1.2.17) and (1.3.5), we can 
evaluate the mean squared error of 62, as 


XT-K)*t(T-K-NY,, 


E(63,— 0? = n 


(1.3.29) 
By differentiating (1.3.29) with respect to N and equating the derivative to 
zero, we can find the value of N that minimizes (1.3.29) to be 


N*—-T-—K-2. (1.3.30) 
Inserting (1.3.30) into (1.3.29), we have 


204 


22  »2\2 — — 
Ei- Y= TR yD? 


(1.3.31) 


which is smaller than the Cramér-Rao bound if K = 1. 


1.4 Model 1 with Linear Constraints 


In this section we shall consider estimation of the parameters f and c? in 
Model 1 when there are certain linear constraints on the elements of f. We 
shall assume that the constraints are of the form 


Q’B=c, (1.4.1) 


where Q is a K X q matrix of known constants and c is a g-vector of known 
constants. We shall also assume g < K and rank (Q) = gq. 

Equation (1.4.1) embodies many of the common constraints that occur in 
practice. For example, if Q’ = (I, 0) where I is the identity matrix of size K, 
and 0 is the K, X K, matrix of zeroes such that K, + K, = K, then the con- 
straints mean that the elements of a K, -component subset of f are specified to 
be equal to certain values and the remaining K, elements are allowed to vary 
freely. As another example, the case in which Q” is a row vector of ones and 
c = limplies the restriction that the sum ofthe regression parameters is unity. 
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The study of this subject is useful for its own sake in addition to providing 
preliminary results for the next section, where we shall discuss tests of the 
linear hypothesis (1.4.1). We shall define the constrained least squares estima- 
tor, present an alternative derivation, show it is BLUE when (1.4.1) is true, 
and finally consider the case where (1.4.1) is made stochastic by the addition of 
a random error term to the right-hand side. 


1.4.4 Constrained Least Squares Estimator (CLS) 


The constrained least squares estimator (CLS) of f, denoted by £, is defined to 
be the value of f that minimizes the sum of squared residuals 


SCA) = (y — XB)’ (y — Xp) (1.4.2) 


under the constraints (1.4.1). In Section 1.2.1 we showed that (1.4.2) is mini- 
mized without constraint at the least squares estimator f. Writing S(8) for the 
sum of squares of the least squares residuals, we can rewrite (1.4.2) as 


S(B) = S(B) + (8 — B'X'X(B — p. (14.3) 


Instead of directly minimizing (1.4.2) under (1.4.1), we minimize (1.4.3) 
under (1.4.1), which is mathematically simpler. 

Put $ — B= dand Q’ Ê — e = y. Then, because S( p does not depend on f, 
the problem is equivalent to the minimization of ó' X' Xó under Q'ó = y. 
Equating the derivatives of ó' X' Xó + 2A'(Q'ó — y) with respect to ó and the 
q-vector of Lagrange multipliers å to zero, we obtain the solution 


ô = (X'X) !Q[Q'(X'X) !'Q] y. (1.4.4) 


Transforming from ó and y to the original variables, we can write the mini- 
mizing value f of S( f) as 


B-B- X XAQ AX XQ Ê- 9. (1.4.5) 
The corresponding estimator of o? can be defined as 
a? = T^y — XB)’ (y — XP). (14.6) 


It is easy to show that the f and c? are the constrained maximum likelihood 
estimators if we assume normality of u in Model 1. 


1.4.2 An Alternative Derivation of the Constrained Least 
Squares Estimator 


Define a K X (K — q) matrix R such that the matrix (Q, R) is nonsingular and 
R’Q = 0. Such a matrix can always be found; it is not unique, and any matrix 
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that satisfies these conditions will do. Then, defining A = (Q, R)’, y = A$, 
and Z = XA^!, we can rewrite the basic model (1.1.4) as 


=XA'AS+u 
= Zy +u. 


If we partition y = (y, 72)’, where y, = Q'f and y, = R’ $, we see that the 
constraints (1.4.1) specify y, and leave y, unspecified. The vector y, has K — q 
elements; thus we have reduced the problem of estimating K parameters 
subject to q constraints to the problem of estimating K — q free parameters. 
Using y, = c and 


A = [Q(Q' Q)'', R(R'R) 1], (1.4.8) 
we have from (1.4.7) 

y — XQ(Q' Q)-'c = XR(R' R-, + u. (1.4.9) 
Let 7; be the least squares estimator of y; in (1.4.9): 

j, = R’R(R’X’XR)'R’X’[y — XQ(Q'Q)-!c]. (1.4.10) 


Now, transforming from y back to f by the relationship f = A~'y, we obtain 
the CLS estimator of ff 


B- R(R/X'XR)-!R'X'y (1.4.11) 
+ [I — R(R'X'XR)"R'X'X]Q(QQ'Q)'!c. 

Note that (1.4.11) is different from (1.4.5). Equation (1.4.5) is valid only if 
X'X is nonsingular, whereas (1.4.11) can be defined even if X'X is singular 
provided that R' X' XR is nonsingular. We can show that if X' X is nonsingu- 
lar, (1.4.11) is unique and equal to (1.4.5). Denote the right-hand sides of 
(1.4.5) and (1.4.11) by fl, and fl,, respectively. Then it is easy to show 

R'X'X|zZ 3 
| Q' |& - 2e (1.4.12) 
Therefore f, = f, if the matrix in the square bracket above is nonsingular. But 
we have 
R'X'X R'X'XR a 
R,Q|- A ; 1.4.13 
Ie ]ne]-[ ^ "ao (413 


where the matrix on the right-hand side is clearly nonsingular because non- 
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singularity of X' X implies nonsingularity of R' X' XR. Because the matrix 
[R, Q] is nonsingular, it follows that the matrix 


x] 
Q , 
is nonsingular, as we desired. 


1.4.8 Constrained Least Squares Estimator as Best Linear 
Unbiased Estimator 


That fis the best linear unbiased estimator follows from the fact that 5, is the 
best linear unbiased estimator of y, in (1.4.9); however, we also can prove it 
directly. Inserting (1.4.9) into (1.4.11) and using (1.4.8), we obtain 


B=B+ R(G'X'XRY'!R'X'u. (1.4.14) 
Therefore, B is unbiased and its variance-covariance matrix is given by 
V(f) = c?R(R'X' XR)!R". (1.4.15) 


We shall now define the class of linear estimators by f/* = C'y — d where C’ is 
aK X T matrix andd is a K-vector. This class is broader than the class of linear 
estimators considered in Section 1.2.5 because ofthe additive constants d. We 
did not include d previously because in the unconstrained model the unbi- 
asedness condition would ensure d = 0. Here, the unbiasedness condition 
E(C' y — d) = BimpliesC' X = I + GQ’ andd = Gc for some arbitrary K X q 
matrix G. We have V(fi*) = c?C'C as in Eq. (1.2.30) and CLS is BLUE 
because of the identity 


C’C — R'/X'XR)'!R* (1.4.16) 
= [C’ - R(R'X'XRy'R'X'][C' — RR'X'XR)''R'X']', 
where we have used C’X = I + GQ’ and R'Q - 0. 


1.4.4 Stochastic Constraints 


Suppose we add a stochastic error term on the right-hand side of (1.4.1), 
namely, 


Q'B—c-cv, (1.4.17) 


where v isa g-vector of random variables such that Ev = 0 and Evv’ = cI. By 
making the constraints stochastic we have departed from the domain of clas- 
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sical statistics and entered that of Bayesian statistics, which treats the un- 
known parameters fj as random variables. Although we generally adopt the 
classical viewpoint in this book, we shall occasionally use Bayesian analysis 
whenever we believe it sheds light on the problem at hand.° 

In terms of the parameterization of (1.4.7), the constraints (1.4.17) are 
equivalent to 


y7etv. (1.4.18) 


We shall first derive the posterior distribution of y, using a prior distribution 
over all the elements of y, because it is mathematically simpler to do so; and 
then we shall treat (1.4.18) as what obtains in the limit as the variances ofthe 
prior distribution for y, go to infinity. We shall assume c? is known, for this 
assumption makes the algebra considerably simpler without changing the 
essentials. For a discussion ofthe case where a prior distribution is assumed on 
a? as well, see Zellner (1971, p. 65) or Theil (1971, p. 670). 
Let the prior density of y be 


f(y) = Qzy PRT exp [-(1/20y —4)'1!(y 4), — (1419) 


where Q is a known variance-covariance matrix. Thus, by Bayes's rule, the 
posterior density of y given y is 


fl») 
[floor 


= c, exp (-(1/2)o^?(y — Zy)'(y — Zy) 
to-u'£t'(-u). 


where c, does not depend on y. Rearranging the terms inside the bracket, we 
have 


Sly) = (1.4.20) 


o^ Xy — Zy)’ (y — Zy) + (y— uq (y — y) (1.4.21) 
= y'(0?Z'Z-QO)y-—-2(0?y'Z- u'Qyyro?y'ytu'Qu 
—7(—-)Y(G?Z'Zt Q7» -y) - Yo ?Z'Zr Q7')y 

to ?y'ytu'(Uy, 
where 
y (a ?Z'Z-r£Y!y oZ’ y + £1 V). (1.4.22) 
Therefore the posterior distribution of y is 
yly ~ NỌ, [c 2Z' Z + (7!]7)), (1.4.23) 
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and the Bayes estimator of y is the posterior mean given in (1.4.22). Because 
y = AB, the Bayes estimator of fl is given by 


B= A7 '[o7XA' ) X" XA7! + Q7 ]- aA IX’ y + Oy] 
= (c?X'X + A'Q7A) (o?X'y + A'O p). (1.4.24) 


We shall now specify u and Q so that they conform to the stochastic con- 
straints (1.4.18). This can be done by putting the first g elements of y equal to c 
(leaving the remaining K — g elements unspecified because their values do not 
matter in the limit we shall consider later), putting 


|? 0 
Q= l 0 H (1.4.25) 


and then taking v to infinity (which expresses the assumption that nothing is a 
priori known about y,). Hence, in the limit we have 


tl 0 
. -i 
lim Q | 0 4i (1.4.26) 


Inserting (1.4.26) into (1.4.24) and writing the first q elements of u as c, we 
finally obtain 


B=(X’X + 2QQ')" (X y + 2Q9), (1.4.27) 


where 22 = c?/t?. - 

We have obtained the estimator f as a special case of the Bayes estimator, 
butthis estimator was originally proposed by Theil and Goldberger (1961) and 
was called the mixed estimator on heuristic grounds. In their heuristic ap- 
proach, Eqs. (1.1.4) and (1.4.17) are combined to yield a system of equations 


iz He] ven 


Note that the multiplication of the second part of the equations by A renders 
the combined error terms homoscedastic (that is, constant variance) so that 
(1.4.28) satisfies the assumptions of Model 1. Then Theil and Goldberger 
proposed the application of the least squares estimator to (1.4.28), an opera- 
tion that yields the same estimator as f given in (1.4.27). An alternative way to 
interpret this estimator as a Bayes estimator is given in Theil (1971, p. 670). 

There is an interesting connection between the Bayes estimator (1.4.27) and 
the constrained least squares estimator (1.4.11): The latter is obtained as the 
limit of the former, taking 4? to infinity. Note that this result is consistent with 
our intuition inasmuch as 4? — oo is equivalent to t? — 0, an equivalency that 
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implies that the stochastic element disappears from the constraints (1.4.17), 
thereby reducing them to the nonstochastic constraints (1.4.1). We shall dem- 
onstrate this below. [Note that the limit of (1.4.27) as 4 —> oo is not 
(QQ^)"!Qc because QQ’ is singular; rather, the limit is a K X K matrix with 


rank q < K.] 
Define 
B-A(4?X'X + QQ’)A’ (1.4.29) 
_ pto +Q’QQ’Q ee | 
A2R'/X'XQ AOR'X'XR[ 
Then, by Theorem 13 of Appendix 1, we have 
— | E-! —E-Q'X'XR(R'X'XR)'! 
—(R'/X'XR)' R'X'XQE"! F^! , 
(1.4.30) 
where 
E = Q'QQ'Q + 4?Q'X'XQ (1.4.31) 
— À?Q'X'XR(R'X'XR)'IR^X'XQ 
and 
F — A?2R'X'XR (1.4.32) 


— A"^R'X'XQG7?Q'X'XQ + Q'QQ'Q)Q'X'XR. 
From (1.4.27) and (1.4.29) we have 


B — A'B-A(C?X'! y + Qo). (1.4.33) 
Using (1.4.30), we have 
Jim A’B"'A(A-2X’y) = RR'X'XR)'R'X'y (1.4.34) 
and 
jim A'B^'AQc = jim (Q, R)B^! b i (1.4.35) 


= Q(Q'Q)!c — R(R'X'XR)'R'X'XQ(Q'Q) 'c. 
Thus we have proved 
lim f= B, (1.4.36) 


Bw 


where $ is given in (1.4.11). 
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1.5 Test of Linear Hypotheses 


In this section we shall regard the linear constraints (1.4.1) as a testable hy- 
pothesis, calling it the null hypothesis. Throughout the section we shall as- 
sume Model | with normality because the distributions of the commonly used 
test statistics are derived under the assumption of normality. We shall discuss 
the t test, the F test, and a test of structural change (a special case of the F test). 


1.5.4 Thet Test 


The t test is an ideal test to use when we have a single constraint, that is, q = 1. 
The F test, which will be discussed in the next section, will be used if q > 1. 
Because fl is normal, as shown in Eq. (1.3.4), we have 


Q’B ~ Nic, c?Q'(X'X)!Q] (1.5.1) 


under the null hypothesis (that is, if Q’8 = c). With q = 1, Q’ isa row vector 
and c is a scalar. Therefore 


Q$-c 
———MÓ 09 1). 1.5.2 
(zo (x’K gps 7 NOY on 
This is the test statistic one would use if a were known. As we have shown in 
Eq. (1.3.7), we have 


a’ 
c? 


(1.5.3) 


The random variables (1.5.2) and (1.5.3) easily can be shown to be indepen- 
dent by using Theorem 6 of Appendix 2 or by noting Eóf' = 0, which implies 
that (i and B are independent because they are normal, which in turn implies 
that fi’ à and fare independent. Hence, by Theorem 3 of Appendix 2 we have 


—QB-c s (1.5.4) 
[z?Q'(X'Xy'gyp2 "P U 
which is Student's £ with T — K degrees of freedom, where G is the square root 
ofthe unbiased estimator of c? defined in Eq. (1.2.18). Note that the denomi- 
nator in (1.5.4)isan estimate of the standard deviation of the numerator. Thus 
the null hypothesis Qf = c can be tested by the statistic (1.5.4). We can use a 
one-tail or two-tail test, depending on the alternative hypothesis. 
In Chapter 3 we shall show that even if u is not normal, (1.5.2) holds 
asymptotically (that is, approximately when T is large) under general condi- 
tions. We shall also show that a? converges to c? in probability as T goes to 
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infinity (the exact definition will be given there). Therefore under general 
distributions ofu the statistic defined in (1.5.4) is asymptotically distributed as 
N(0, 1) and can be used to test the hypothesis using the standard normal 
table.5In this case 6? may be used in place of 6? because 6? also converges to o? 
in probability. 


1.5.2 TheF Test 


In this section we shall consider the test of the null hypothesis Q'B = c against 
the alternative hypothesis Q’B + c when it involves more than one constraint 
(that is, q > 1). 

We shall derive the F test as a simple transformation of the likelihood ratio 
test. Suppose that the likelihood function is in general given by L(€, 0), where 
č is a sample and @ is a vector of parameters. Let the null hypothesis be Ho: 
8 € S, where S, is a subset of the parameter space O and let the alternative 
hypothesis be H,: 0 € S, where S, is another subset of O. Then the likelihood 
ratio test is defined by the following procedure: 
max L(€, 0) 
6e€S, 


max LEO ^9 
BESS, 


Reject Hy if à = (1.5.5) 


where g is chosen so as to satisfy P(A < g) = a for a given significance level a. 
The likelihood ratio test may be justified on several grounds: (1) It is intui- 
tively appealing because of its connection with the Neyman-Pearson lemma. 
(2) It is known to lead to good tests in many cases. (3) It has asymptotically 
optimal properties, as shown by Wald (1943). 

Now let us obtain the likelihood ratio test of Q’f = c against Q’B#c in 
Model | with normality. When we use the results of Section 1.4.1, the numer- 
ator likelihood becomes 


max L = (2x8?) T? exp [70.50 7(y — XB)’ (y — XB)] (1.5.6) 


= (229? y 72-772, 


Because in the present case §, U S, is the whole parameter space of f, the 
maximization in the denominator of (1.5.5) is carried out without constraint. 
Therefore the denominator likelihood is given by 


max L—(2zó?) 7Pexp|[-0.567(y — XÉ'(y - XB] —— (1.5) 


= (2na? y"? e 77, 
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Hence, the likelihood ratio test of Q’ f = c is 


Reject Q’B=c if A= [< 8, (1.5.8) 


where g is chosen as in (1.5.5). 
We actually use the eguivalent test 


T-K SA - SB) >h, 
q SB) 
where A is appropriately chosen. Clearly, (1.5.8) and (1.5.9) can be made 
equivalent by putting A = [(7 — K)/q](g ?/7 — 1). The test (1.5.9) is called the 
F test because under the null hypothesis 7 is distributed as F(g, T — K)—the 
F distribution with q and T — K degrees of freedom — as we shall show next. 
From (1.4.3) and (1.4.5) we have 
S(B) ~ SÊ = (Q'B—o'[Io'('Xy'Qr'(Q'-o. — 0510 
But since (1.5.1) holds even if Q is a matrix, we have by Theorem 1 of 
Appendix 2 
SC) — SÊ 


g? 


Reject Q’B=c if n= (1.5.9) 


~x. (1.5.11) 


Because this chi-square variable is independent of the chi-square variable 
given in (1.5.3) by an application of Theorem 5 of Appendix 2 (or by the 
independence of û and f), we have by Theorem 4 of Appendix 2 
-TK QÊ- VIRX- n rg 
q ü'ü 
(1.5.12) 


We shall now give an alternative motivation for the test statistic (1.5.12). 
For this purpose, consider the general problem of testing the null hypothesis 
Hy: 0 = 0,, where 0 = (0,, 0,)' is a two-dimensional vector of parameters. 
Suppose we wish to construct a test statistic on the basis of an estimator 6 that 
is distributed as N(@), V) under the null hypothesis, where we assume V to bea 
known diagonal matrix for simplicity. Consider the following two tests of Hy: 


Reject H, if (@—@&)’'(@—0)>c (1.5.13) 
and 


Reject Hy if (0—60,) V-(6— 6,) » d, (1.5.14) 
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where c and d are determined so as to make the probability of Type I error 
equal to the same given value for each test. That (1.5.14) is preferred to 
(1.5.13) can be argued as follows: Let of and 02 be the variances of 6, and 6,, 
respectively, and suppose v? < o2. Then a deviation of 0, from 6, provides a 
greater reason for rejecting Ho than the same size deviation of @, from 829 
because the latter could be caused by variability of 0; rather than by falseness 
of Hp. The test statistic (1.5.14) precisely incorporates this consideration. 

The test (1.5.14) is commonly used for the general case ofa g-vector 8 and a 
general variance-covariance matrix V. Another attractive feature of the test is 
that 


(Ê — 0Y V- Ê — 6) ~ x2. (1.5.15) 


Applying this test to the test of the hypothesis (1.4.1), we readily see that the 
test should be based on 


^f — aY / ^VY-1 -1 IR — 
P= VTAN ANOR- og 1.5.19 
This test can be exactly valid only when o? is known and hence is analogous to 
the standard normal test based on (1.5.2). Thus the F test statistic (1.5.12) may 
be regarded as a natural adaptation of (1.5.16) to the situation where c? must 
be estimated. (For a rigorous mathematical discussion of the optimal proper- 
ties of the F-test, the reader is referred to Scheffé, 1959, p. 46.) 

By comparing (1.5.4) and (1.5.12) we immediately note that if q = 1 (and 
therefore Q’ is a row vector) the F statistic is the square of the t statistic. This 
fact clearly indicates that if q = 1 the t test should be used rather than the F test 
because a one-tail test is possible only with the : test. 

As Stated earlier, (1.5.1) holds asymptotically even if u is not normal. Be- 
cause 8? = ü'ü/(T — K) converges to c? in probability, the linear hypothesis 
can be tested without assuming normality of u by using the fact that qn is 
asymptotically distributed as x2. Some people prefer to use the F test in this 
situation. The remark in note 6 applies to this practice. 

The F statistic 7 given in (1.5.12) takes on a variety of forms as we insert a 
variety of specific values into Q and c. As an example, consider the case where 
fl is partitioned as 8’ = ( 1, B5), where 8, is a K,-vector and fj; is a K;-vector 
such that K, + K, = K, and the null hypothesis specifies f, = f; and leaves f, 
unspecified. This hypothesis can be written in the form Q'f = c by putting 
Q’ = (0, I), where 0 is the K, X K, matrix of zeros and Lis the identity matrix 
of size K,, and by putting c = f). Inserting these values into (1.5.12) yields 
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T — K (& — BY (0, DXX (0, D'T (5, — B) 
K, i'd 


~ F(K,, T—K). 


We can write (1.5.17) in a more suggestive form. Partition X as X = (X,, X,) 
conformably with the partition of Band define M, = I — X,(X1X,) ! Xj. Then 
by Theorem 13 of Appendix 1 we have 


[(0, IXX’X) "0, DT! = X4M,X,. (1.5.18) 


n= (1.5.17) 


Therefore, using (1.5.18), we can rewrite (1.5.17) as 
= T-K (B. — A XM, X( 8 tL E) ~ F(K;, T- K). (1.5.19) 


AJA 


K, ii 


Of particular interest is the further special case of (1.5.19), where K, = 1, so 
that f, is a scalar coefficient on the first column of X, which we assume to be 
the vector of ones (denoted by 1), and where f, = 0. Then M, becomes L = 
I — T-!]r. Therefore, using (1.2.13), we have 


B, = (X;LX;)! X2Ly, (1.5.20) 
so (1.5.19) can now be written as 
— A 4 -Iy 
= TK YIXAXLX J Ly | poe 1, T— K). (1.5.21) 
K-1 ü'ü 
Using the definition of R? given in (1.2.9), we can rewrite (1.5.21) as 
.I-K R 
|». K-11—-R2 


since fi’ = y’Ly — y'LX(X2LX,;) 'X2Ly because of Theorem 15 of Appen- 
dix 1. The value of statistic (1.5.22) usually appears in computer printouts and 
iscommonly interpreted as the test statistic for the hypothesis that the popula- 
tion counterpart of R? is equal to 0. 


~ F(K—1, T — K), (1.5.22) 


1.5.3 A Test of Structural Change when Variances Are Equal 
Suppose we have two regression regimes 

yy 7 Xf, t u (1.5.23) 
and 


y; = X,f, uw, (1.5.24) 
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where the vectors and matrices in (1.5.23) have 7, rows and those in (1.5.24) 
T; rows, X, isa T, X K* matrix and X, isa T, X K* matrix, and u, and u, are 
normally distributed with zero means and variance-covariance matrix 


uy t an oily, 0 
2M, | 0 a, | 


We assume that both X, and X, have rank equal to K*." We want to test the 
null hypothesis f, = J, assuming 01 = o3(=07) in the present section and 
01 + ci in the next section. This test is especially important in econometric 
time series because the econometrician often suspects the occurrence of a 
structural change from one era to another (say, from the prewar era to the 
postwar era), a change that manifests itself in the regression parameters. When 
01 = 02, this test can be handled as a special case of the standard F test 
presented in the preceding section. 

To apply the F test to the problem, combine Eqs. (1.5.23) and (1.5.24) as 


-[ n) on e] 


Then, since 22 = a3(=07), (1.5.25) is the same as Model 1 with normality; 
hence we can represent our hypothesis f = fl; as a standard linear hypothesis 
on Model 1 with normality by putting T= T, + T;, K= 2K*, g= K*,Q'— 
(I, — I), and c = 0. Inserting these values into (1.5.12) yields the test statistic 
_ ( T, + T;— 2x) (Bi — BVA! + OXY TMB, — Ê) 
7 K* Yl —- X3) XY 
~ F(K*, T, + T, —2K*), (1.5.26) 
where B, = (X{X,)"!Xty, and À, = (XXD Xy, 
We shall now give an alternative derivation of (1.5.26). In (1.5.25) we 


combined Eqs. (1.5.23) and (1.5.24) without making use of the hypothesis 
B, = B,. If we make use of it, we can combine the two equations as 


y ^ Xf +u, (1.5.27) 


where we have defined X = (X^, X2) and ff = B, = B,. Let S( Ê) be the sum of 
squared residuals from (1.5.25), that is, 


where 


S(B) = y [I — X(X'X)!X'ly (1.5.28) 
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and let S(f) be the sum of squared residuals from (1.5.27), that is, 
S(B) = y'I — X(X'X)'X/]y. (1.5.29) 
Then using (1.5.9) we have® 
_ T+ T) - 2K* S(B)— S(B) — SU) 


— 2K*). 
ra SG RS Tt Te 2K) 


(1.5.30) 


To show the equivalence of (1.5.26) and (1.5.30), we must show 


S(B) — SÊ) = (Â, — BY XXY! + (XXD T AÂ, — Ê. 
(1.5.31) 


From (1.5.29) we have 
D —~! X, " 1 —uyr n 
S(B) = y’ |1— X, (XiX, + XX) (Xi, X3) | y (1.5.32) 
and from (1.5.28) we have 


A-cvir- | X (XXNX; 0 
SB) — y | | 0 sin | y (1.5.33) 


= yi[I — X (XiX) Xi ]y, + yA — X200X5) X2ly2- 
Therefore, from (1.5.32) and (1.5.33) we get 


S(B) — SÊ) (1.5.34) 
= (yiX,, y2X2) 
x [Go — (X, + XiX) —(XiX, + XX)! | 
—(X{X, + XjiXj)'! (XX)! — (X1X, + XX)! 
X 
«I. 
= (YiXi, y2Xo) 


x EOM [XXN + REX) "IT "EEG" — OX] 


x [xen 
Xy 
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the last line of which is equal to the right-hand side of (1.5.31). The last 
equality of (1.5.34) follows from Theorem 19 of Appendix 1. 

The hypothesis f, = fl; is merely one of many linear hypotheses we can 
impose on the f of the model (1.5.25). For instance, we might want to test the 
equality of a subset of f, with the corresponding subset of fl. If the subset 
consists of the first Kf elements of both f, and f, , we should put T = T, + T; 
and K = 2K* as before, but q = KT, Q’ = (I, 0, —I, 0), and c = 0 in the for- 
mula (1.5.12). 

If, however, we wish to test the equality of a single element of f, with the 
corresponding element of fl, we should use the ! test rather than the F test for 
the reason given in Section 1.5.2. Suppose the null hypothesis is f//,; = f; 
where these are the ith elements of fj, and f, , respectively. Let the ith columns 
of X, and X, be denoted by x,, and x,, and let X, and X», consist of the 
remaining K* — 1 columns of X, and X,, respectively. Define M,, =I — 
Kyo XiXigy) Xios Ku = Mig, and ¥, = Moy, and similarly define 
Mh, €;,, and $;. Then using Eqs. (1.2.12) and (1.2.13), we have 


BL Xu 

Bu & Xy N (fu zi (1.5.35) 
and 

; X5 

hc ng) 1539) 


Therefore, under the null hypothesis, 
buba ~ NO, 1). (1.5.37) 
( 601 4 05 y 
€uXy Xn 


Also, by Theorem 2 of Appendix 2 


iM 2M 
YVAN. + LA Y2VloY2 ~ X en-are: (1.5.38) 
a? a3 


where M, =I —X,(X{X,)"'X{ and M,=I—X,(X}X,)"'X}. Because 
(1.5.37) and (1.5.38) are independent, we have by Theorem 3 of Appendix 2 
(B, = EXT, +T,— 2K*)'? 


7—3 — NNa AN ~ Spen-zx. (1.5.39) 
( ei oj i) "(san q as y^ T+T 
Kuu Kfz oF a3 
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Putting o? = c2 in (1.5.39) simplifies it to 


Bu — By 
( Sog y ~ Shenae, (1.5.40) 
XyXu Kika 


where G? is the unbiased estimate of a? obtained from the model (1.5.25), 
that is, 


[I — X(X’X)'X'Jy. (1.5.41) 


1.5.4 A Test of Structural Change when Variances Are Unequal 


In this section we shall remove the assumption c = c2 and shall study how to 
test the equality of f, and £}. The problem is considerably more difficult than 
the case considered in the previous section; in fact, there is no definitive 
solution to this problem. Difficulty arises because (1.5.25) is no longer Model 
1 because of the heteroscedasticity of u. Another way to pinpoint the difficulty 
is to note that c7 and c2 do not drop out from the formula (1.5.39) for the t 
statistic. 

Before proceeding to discuss tests of the equality of f; and 8, when c? # 3, 
we shall first consider a test of the equality of the variances. For, if the hypoth- 
esis 71 = a2 is accepted, we can use the F test of the previous section. The null 
hypothesis to be tested is 02 = 03(=o7). Under the null hypothesis we have 


yiMuy 

AM ~ yh. (1.5.42) 
and 

y; Muy 

a ~ yh (1.5.43) 


Because these two chi-square variables are independent by the assumptions of 
the model, we have by Theorem 4 of Appendix 2 


T, — K* yiM,y; 
T, — K* y;M)y; 
Unlike the F test of Section 1.5.2, a two-tailed test should be used here because 


either a large or a small value of (1.5.44) is a reason for rejecting the null 
hypothesis. 


~ F(T, — K*, T, — K*). (1.5.44) 
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Regarding the test of the equality of the regression parameters, we shall 
consider only the special case considered at the end of Section 1.5.3, namely, 
the test of the equality of single elements, fi, = f, where the t test is applica- 
ble. The problem is essentially that of testing the equality of two normal means 
when the variances are unequal; it is well known among statisticians as the 
Behrens-Fisher problem. Many methods have been proposed and others are 
still being proposed in current journals; yet there is no definitive solution to 
the problem. Kendall and Stuart (1979, Vol. 2, p. 159) have discussed various 
methods of coping with the problem. We shall present one of the methods, 
which is attributable to Welch (1938). 

As we noted earlier, the difficulty lies in the fact that one cannot derive 
(1.5.40) from (1.5.39) unless 7? = c2. We shall present a method based on the 
assumption that a slight modification of (1.5.40), namely, 


g= en 
a, a3" 
&uX XX 
li^li 21^2i 


where 6? — (T, — K*) !y!My,y, and 02 — (T, — K*)"'y3M,y,, is approxi- 
mately distributed as Student's / with degrees of freedom to be appropriately 
determined. Because the statement (1.5.37) is still valid, the assumption that 
(1.5.45) is approximately Student's £ is equivalent to the assumption that w 
defined by 


(1.5.45) 


FT f 
wa iu Kuka 
aĵ 05 
Kku Xuka 


"y (1.5.46) 


is approximately x? for some v. Because Ew = v, w has the same mean as x2. 
We shall determine v so as to satisfy 


Vw = 2v. (1.5.47) 
Solving (1.5.47) for v, we obtain 


v= (1.5.48) 
oar FOR 
(T, — K*\R Ki)? (Ta — KK? 


Finally, using the standard normal variable (1.5.37) and the approximate 
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chi-square variable (1.5.46), we have approximately 
é~S,. (1.5.49) 


In practice v will be estimated by inserting 6? and 62 into the right-hand side of 
(1.5.48) and then choosing the integer closest to the calculated value. 

Unfortunately, Welch's method does not easily generalize to a situation 
where the equality of vector parameters is involved. Toyoda (1974), like 
Welch, proposed that both the denominator and the numerator chi-square 
variables of (1.5.26) be approximated by the moment method; but the result- 
ing test statistic is independent ofthe unknown parameters only under unreal- 
isticassumptions. Schmidt and Sickles (1977) found Toyoda's approximation 
to be rather deficient. 

In view of the difficulty encountered in generalizing the Welch method, it 
seems that we should look for other ways to test the equality of the regression 
parameters in the unequal-variances case. There are two obvious methods 
that come to mind: They are (1) the asymptotic likelihood ratio test and (2) the 
asymptotic F test (see Goldfeld and Quandt, 1978)? 

The likelihood function of the model defined by (1.5.23) and (1.5.24) is 


L= (20) * Dg Tig; Ts (1.5.50) 
X exp [-0.5e (y, — Xi A) (yi — X8] 
X exp [-0.502(y; — X; B5)'(y; — X;f)]. 


The value of L attained when it is maximized without constraint, de- 
noted by Ê, can be obtained by evaluating the parameters of Lat f, — f, 
h= [3 gi- Gt = T1 Gg; — X, B,)’ (y; — X, Àj), and ej-6i- 
T;(y,— X,ÀY (y — XÂ»). The value of L attained when it is maximized 
subject to the constraints fj, = ,(=f), denoted by L, can be obtained by 
evaluating the parameters of L at the constrained maximum likelihood esti- 
mates: fj, = fl(— B), 01, and o2. These estimates can be iteratively obtained as 
follows: 
Step 1. Calculate B = (@7?X{X, + 07?X2X) (61?Xjy; + 67? Xy). 
Step 2. Calculate a7 = T;(y,— Xif)(yi - Xif) and a= 
Tz (yo — XofY'(y — X28). 
Step 3. Repeat Step 1, substituting g? and o2 for 6? and 43. 
Step 4. Repeat Step 2, substituting the estimates of f obtained at Step 3 
for f. 
Continue this process until the estimates converge. In practice, however, the 
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estimates obtained at the end of Step 1 and Step 2 may be used without 
changing the asymptotic result (1.5.51). 

Using the asymptotic theory of the likelihood ratio test, which will be 
developed in Section 4.5.1, we have asymptotically (that is, approximately 
when both T, and 7; are large) 


—2 log (L/L) = T, log (02/62) + T, log (02/63) ~ xà. (1.5.51) 


The null hypothesis f, = ff; is to be rejected when the statistic (1.5.51) is larger 
than a certain value. 

The asymptotic F test is derived by the following simple procedure: First, 
estimate c? and o2 by 67 and 61, respectively, and define J = 6,/0; . Second, 
multiply both sides of (1.5.24) by f and define the new equation 


yz = Xjf, + vj, (1.5.52) 


where y? — py,, Xf = fX, and uf = pu,. Third, treat (1.5.23) and (1.5.52) as 
the given equations and perform the F test (1.5.26) on them. The method 
works asymptotically because the variance of uf is approximately the same as 
that of u, when T, and T, are large, because f converges to o, /0; in probability. 
Goldfeld and Quandt (1978) conducted a Monte Carlo experiment that 
showed that, when c1 # o3, the asymptotic F test performs well, closely 
followed by the asymptotic likelihood ratio test, whereas the F test based on 
the assumption of equality of the variances could be considerably inferior. 


1.6 Prediction 
We shall add to Model 1 the pth period relationship (where p > T) 
Yp = X B+ u,, (1.6.1) 


where y, and u, are scalars and x, are the pth period observations on the 
regressors that we assume to be random variables distributed independently of 
u, and u.'° We shall also assume that u, is distributed independently of u with 
Eu, = 0 and Vu, = a?. The problem we shall consider in this section is how to 
predict y, by a function of y, X, and x, when fi and a? are unknown. 

We shall only consider predictors of y, that can be written in the following 
form: 


yt= x; p*, (1.6.2) 


where fj* is an arbitrary estimator of f and a function of y and X. Here, 8* may 
be either linear or nonlinear and unbiased or biased. Although there are more 
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general predictors of the form f(x,, y, X), it is natural to consider (1.6.2) 
because x, is the best predictor of y, if f is known. 
The mean squared prediction error of y* conditional on x, is given by 


E[(yz — yix] = a? + x, E(B* — BY B* — B)’x,, (1.6.3) 


where the equality follows from the assumption of independence between fj* 
and u,. Equation (1.6.3) clearly demonstrates that as long as we consider only 
predictors of the form (1.6.2) and as long as we use the mean squared predic- 
tion error asa criterion for ranking predictors, the prediction of y, is essentially 
the same problem as the estimation of x; f. Thus the better the estimator of x, f 
is, the better the predictor of y,. ` . 

In particular, the result of Section 1.2.5 implies that x,8, where f is the LS 
estimator, is the best predictor in the class of predictors of the form x,C’y such 
that C^ X = I, which we shall state as the following theorem: 


THEOREM 1.6.1. Let B be the LS estimator and C be an arbitrary TX K 
matrix of constants such that C’X = I. Then 


E[(x/B — yix] S E[(xIC'y — yix, (1.6.4) 
where the equality holds if and only if C = X(X'X) !. 


Actually we can prove a slightly stronger theorem, which states that the least 
squares predictor xfi is the best linear unbiased predictor. 


THEOREM 1.6.2. Let d be a 7-vector the elements of which are either con- 
stants or functions of x,. Then 


E[(d'y — yix] = Efx; — ypx] 


for any d such that E(d’y|x,) = E(y,|x,). The equality holds if and only if 
d = x( X'X)'!X*. 


Proof. The unbiasedness condition E(d'y|x,) = E(y,|x,) implies 


d'X = x. (1.6.5) 
Using (1.6.5), we have 
El(d'y — yx, = E[(d'u — u,)|x,] (1.6.6) 
= g*(1 + d'd). 


But from (1.6.3) we obtain 
E[(x; — lx] = PL + x(X'X)-!x,]. (1.6.7) 
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Therefore the theorem follows from 
d'd — x(X'X)!x, = [d — X(X'X) !x]'[d — X(X’X)"'x,], (1.6.8) 
where we have used (1.6.5) again. 


Theorem 1.6.2 implies Theorem 1.6.1 because Cx, of Theorem 1.6.1 satisfies 
the condition for d given by (1.6.5)."! 

In Section 1.2.4 we stated that if we cannot choose between a pair of 
estimators by the criterion defined by Definition 1.2.2 (that is, if the difference 
of the mean squared error matrices is neither nonnegative definite nor non- 
positive definite), we can use the trace or the determinant as a criterion. The 
conditional mean squared prediction error defined in (1.6.3) provides an 
alternative scalar criterion, which may have a more intuitive appeal than the 
trace or the determinant because it is directly related to an important purpose 
to which estimation is put— namely, prediction. However, it has one serious 
weakness: At the time when the choice of estimators is made, x, is usually not 
observed. 

A solution to the dilemma is to assume a certain distribution for the random 
variables x, and take the further expectation of (1.6.3) with respect to that 
distribution. Following Amemiya (1966), let us assume 


Exx,—-T !X'X. (1.6.9) 
Then we obtain from (1.6.3) the unconditional mean squared prediction error 
E(ys — yy. = 0? + T E(f* — B X'X(B* — P). (1.6.10) 


This provides a workable and intuitively appealing criterion for choosing an 
estimator. The use of this criterion in choosing models will be discussed in 
Section 2.1.5. The unconditional mean squared prediction error of the least 
squares predictor xf is given by 


Ex Â — y,? = o*(| + TK). (1.6.11) 


Exercises 


1. (Section 1.1.2) 
Give an example of a pair of random variables that are noncorrelated but 
not independent. 


2. (Section 1.1.3) 
Let y and x be scalar dichotomous random variables with zero means. 
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Define u = y — Cov(y, x Vx)! x. Prove E(u|x) = 0. Are u and x inde- 
pendent? 


. (Section 1.1.3) 


Let y be a scalar random variable and x be a vector of random variables. 
Prove E[y — E(y|x)]* 3 E[y — g(x)} for any function g. 


. (Section 1.1.3) 


A fair die is rolled. Let y be the face number showing and define x by the 
rule: 


x=y if y iseven 
—0 if y isodd. 


Find the best predictor and the best linear predictor of y based on x and 
compare the mean squared prediction errors. 


. (Section 1.2.1) 


Assume that y is 3 X 1 and X = (X,, X,) is 3X 2 and draw a three-di- 
mensional analog of Figure 1.1. 


. (Section 1.2.5) 


Prove that the class of linear unbiased estimators is equivalent to the class 
of instrumental variables estimators. 


. (Section 1.2.5) 


In Model 1 find a member of the class of linear unbiased estimators for 
which the trace of the mean squared error matrix is the smallest, by 
minimizing tr C'C subject to the constraint C' X — I. 


. (Section 1.2.5) 


Prove that J, defined by (1.2.14) is a best linear unbiased estimator of fj, . 


. (Section 1.2.5) 


In Model 1 further assume K = 1 and X = I, wherelis the vector of ones. 
Define £t — l'y/(T + 1), obtain its mean squared error, and compare it 
with that of the least squares estimator £. 


. (Section 1.2.5) 


In Model 1 further assume that T= 3, K — 1, X —(1, 1, 1)’, and that 
(uj), t = 1, 2, 3, are independent with the distribution 


u,=oa with probability 4 
=—g with probability 4. 
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Obtain the mean squared error of Ba = y’y/x’y and compare it with that 
of the least squares estimator £. (Note that fr, the reverse least squares 
estimator, is obtained by minimizing the sum of squared errors in the 
direction of the x-axis.) 


(Section 1.3.2) 

Assume K — 1 in Model ! with normality and furthermore assume 
B? = c?. Obtain the maximum likelihood estimator of f and obtain the 
Cramér-Rao lower bound. 


(Section 1.4.2) 
Suppose 


„_ [11 
Q'- l 2 J 
Find a row vector R’ such that (Q, R) is nonsingular and R' Q =0. 


(Section 1.4.2) 
Somebody has run a least squares regression in the classical regression 
model (Model 1) and reported 


^ 


a Bı 5 3 1 1 
B-|&|-|-4| and &x'-|12 1 
b, 2 112 


On the basis of this information, how would you estimate fl if you 
believed f, + f, = f,? 


. (Section 1.5.3) 


We have T'observations on the dependent variable y, and the independent 
variables x, and z,(— 1,2, ... , T. We believe a structural change 
occurred from the first period consisting of 7 observations to the second 
period consisting of T, observations (T, + T; = T) in such a way that in 
the first period Ey, depends linearly on x, (with an intercept) but not on z, 
whereas in the second period Ey, depends linearly on z, (with an intercept) 
but not on x. How do you test this hypothesis? You may assume that ( y,) 
are independent with constant variance g? for t= 1,2, . . . , T. 


. (Section 1.5.3) 


Consider the following two regression equations, each of which satisfies 
the assumptions of the classical regression model (Model 1): 
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(1) y,=a,l+a,x, +u, 
(2) y.=8,1+ £x; tw, 


where a’s and f’s are scalar parameters, y,, y;, X1, X2, U,, and u, are 
seven-component vectors, and ] is the seven-component vector of ones. 
Assume that u, and u, are normally distributed with the common var- 
iance c? and that u, and u, are independent of each other. Suppose that 
l'x 2I x; ^0 and l' y, = l'y, = 7. Suppose also that the sample mo- 
ment matrix of the four observable variables is given as 


X1 y» Xx X 
y 93 7 2 15 
y 7 9 3 1 
x 2 3 2 12 
x; 15 ] 12 1 


For example, the table shows y; y, = 9.3 and yi y; = 7. Should you reject 
the joint hypothesis "o, = f, and a, + 2a, =f,” at the 5% significance 
level? How about at 196? 


. (Section 1.5.3) 


Consider a classical regression model 
y= aX t fiz, +u, 
y; = aX t hz t wu; 
y; = 05X; + 2; + Uy, 


where a’s and f’s are scalar unknown parameters, the other variables are 
vectors of ten elements, x’s and z's are vectors of known constants, and u's 
are normal with mean zero and Eu,u; = 071 for every i and Euu; = 0 if 
i + j. Suppose the observed vector products are as follow: 


yi 72, yix 7. yi» 72 
yi; 7l yox,=3, yix =2 
yiz 72, yz 73, yin =l 
xix; —Z(2,— 4 forevery i 
xiz =0 forevery i. 


Test the joint hypothesis (a, = a, and f, = f) at the 5% significance 
level. 
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. (Section 1.6) 


Consider Model 1 with the added prediction-period equation (1.6.1). 
Suppose Z is a T X L matrix of known constants and z, is an L-vector of 
known constants. Which of the following predictors of y, is better? Ex- 
plain. 

5,7 1,22) Z'y 

Jy = 2,2) Z'X(X'X) !X'y. 
(Section 1.6) 
Consider the case K = 2 in Model 1, where y, = f, + f; x, + u,. For the 
prediction period we have y, = $, + f, x, + up, where u, satisfies the as- 
sumptions of Section !.6. Obtain the mean squared prediction error of the 
predictor J, = T^! ZZ, y, and compare it with the mean squared predic- 
tion error of the least squares predictor. 
(Section 1.6) 


Prove that any d satisfying (1.6.5) can be written as Cx, for some C such 
that C’X =I. 


2 Recent Developments 
in Regression Analysis 


In this chapter we shall present three additional topics. They can be discussed 
in the framework of Model 1 but are grouped here in a separate chapter 
because they involve developments more recent than the results of the pre- 
vious chapter. 


2.1 Selection of Regressors! 
2.1.1 Introduction 


Most of the discussion in Chapter 1 proceeded on the assumption that a given 
model (Model 1 with or without normality) is correct. This is the ideal situa- 
tion that would occur if econometricians could unambiguously answer ques- 
tions such as which independent variables should be included in the right- 
hand side of the regression equation; which transformation, if any, should be 
applied to the independent variables; and what assumptions should be im- 
posed on the distribution of the error terms on the basis of economic theory. In 
practice this ideal situation seldom occurs, and some aspects of the model 
specification remain doubtful in the minds of econometricians. Then they 
must not only estimate the parameters of a given model but also choose a 
model among many models. 

We have already considered a particular type of the problem of model 
selection, namely, the problem of choosing between Model 1 without con- 
straint and Model 1 with the linear constraints Qf = c. The model selection 
of this type (selection among “nested” models), where one model is a special 
case of the other broader model, is the easiest to handle because the standard 
technique of hypothesis testing is precisely geared for handling this problem. 
We shall encounter many instances of this type of problem in later chapters. In 
Section 2.1, however, we face the more unorthodox problem of choosing 
between models or hypotheses, neither of which is contained in the other 
(“nonnested” models). 
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In particular, we shall study how to choose between the two competing 
regression equations 


y=Xf,+0, (2.1.1) 
and 
y-Xfntu, (2.1.2) 


where X, isa T X K, matrix of constants, X, isa T X K, matrix of constants, 
Eu, = Eu, = 0, Euu; = o3I, and Eu;u; = c?I. Note that this notation differs 
from that of Chapter | in that here there is no explicit connection between X, 
and X,: X, and X, may contain some common column vectors or they may 
not; one of X, and X, may be completely contained in the other or they may be 
completely different matrices. Note also that the dependent variable y is the 
same for both equations. (Selection among more general models will be dis- 
cussed briefly in Section 4.5.) 

This problem is quite common in econometrics, for econometricians often 
run several regressions, each of which purports to explain the same dependent 
variable, and then they choose the equation which satisfies them most accord- 
ing to some criterion. The choice is carried out through an intuitive and 
unsystematic thought process, as the analysts consider diverse factors such as a 
goodness of fit, reasonableness of the sign and magnitude of an estimated 
coefficient, and the value of a t statistic on each regression coefficient. Among 
these considerations, the degree of fit normally plays an important role, al- 
though the others should certainly not be ignored. Therefore, in the present 
study we shall focus our attention on the problem of finding an appropriate 
measure of the degree of fit. The multiple correlation coefficient R?, defined in 
(1.2.8), has an intuitive appeal and is a useful descriptive statistic; however, it 
has one obvious weakness, namely, that it attains its maximum of unity when 
one uses aS many independent variables as there are observations (that is, 
when K = T). Much of what we do here may be regarded as a way to rectify 
that weakness by modifying R?. 


2.1.2 Statistical Decision Theory 


We shall briefly explain the terminology used in statistical decision theory. For 
a more thorough treatment of the subject, the reader should consult Zacks 
(1971). Statistical decision theory is a branch of game theory that analyzes the 
game played by statisticians against nature. The goal of the game for statisti- 
cians is to make a guess at the value of a parameter (chosen by nature) on the 
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basis of the observed sample, and their gain from the game is a function ofhow 
close their guess is to the true value. The major components of the game are O, 
the parameter space; Y, the sample space; and D, the decision space (the 
totality of functions from Y to O). We shall denote a single element of each 
space by the lowercase letters 0, y, d. Thus, if y is a particular observed sample 
(a vector of random variables), d is a function of y (called a statistic or an 
estimator) used to estimate Q. We assume that the loss incurred by choosing d 
when the true value of the parameter is 0 is given by the loss function L(d, 0). 
We shall define a few standard terms used in statistical decision theory. 


Risk. The expected loss E, L(d, 0) for which the expectation is taken with 
respect to y (which is implicitly in the argument of the function d) is called the 
risk and is denoted by R(d|8). 

Uniformly smaller risk. The estimator? d, has a uniformly smaller risk than 
the estimator d, if R(d,|@) = R(d,|6) for all 0 € O and R(d,|@) < R(d,|@) for at 
least one 0 E O. 

Admissible. An estimator is admissible if there is no d in D that has a 
uniformly smaller risk. Otherwise it is called inadmissible. 

Minimax. The estimator d* is called a minimax estimator if 


max R(d*|@) = min max R(d|@). 
6c0 deD 6c0 


The minimax estimator protects the statistician against the worst pos- 
sible situation. If maxeco R(d|0) does not exist, it should be replaced with 
Supeco R(d|6) in the preceding definition (and min with inf). 

Posterior risk. The expected loss E,L(d, 0) for which the expectation is 
taken with respect to the posterior distribution of @ given y is called the 
posterior risk and is denoted by R(d|y). It obviously depends on the particular 
prior distribution used in obtaining the posterior distribution. 

Bayes estimator. The Bayes estimator, given a particular prior distribution, 
minimizes the posterior risk R(d|y). If the loss function is quadratic, namely, 
L(d, 8) = (d — 6)’ W(d — 0) where W is an arbitrary nonsingular matrix, the 
posterior risk Eg(d — 0)' W(d — 0) is minimized at d = E40, the posterior 
mean of 0. An example of the Bayes estimator was given in Section 1.4.4. 

Regret. Let R(d|0) be the risk. Then the regret W(d|0) is defined by 


W(d|@) = R(A\@) — min R(dl6). 


Minimax regret. The minimax regret strategy minimizes maxgeo W(d|0) 
with respect to d. 
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Some useful results can be stated informally as remarks rather than stating 
them formally as theorems. 


REMARK 2.1.1. A Bayes estimator is admissible. 


REMARK 2.1.2. A minimax estimator is either a Bayes estimator or the limit 
of a sequence of Bayes estimators. The latter is called a generalized Bayes 
estimator. (In contrast, a Bayes estimator is sometimes called a proper Bayes 
estimator.) 


REMARK 2.1.3. A generalized Bayes estimator with a constant risk is 
minimax. 


REMARK 2.1.4. Anadmissible estimator may or may not be minimax, anda 
minimax estimator may or may not be admissible. 


2.1.3. Bayesian Solution 


The Bayesian solution to the selection-of-regressors problem provides a peda- 
gogically useful starting point although it does not necessarily lead to a useful 
solution in practice. We can obtain the Bayesian solution as a special case of 
the Bayes estimator (defined in Section 2.1.2) for which both O and D consist 
oftwo elements. Let the losses be represented as shown in Table 2.1, where L,; 
is the loss incurred by choosing model 1 when model 2 is the true model 
and L, is the loss incurred by choosing model 2 when model 1 is the true 
model.? Then, by the result of Section 2.1.2, the Bayesian strategy is to choose 
model 1 if 


L;PQly) < L;P(1|y), (2.1.3) 


where P(i|y), i= 1 and 2, is the posterior probability that the model i is true 
given the sample y. The posterior probabilities are obtained by Bayes's rule as 


Í f(y18,)/(8;| D)P(1)40, 


P(lly) = (2.1.4) 
| f(y10,)/(8,1)P(1)40, + | f(y|8:)/(8,|2)P(2)46; 


and similarly for P(2|y), where 0, = (B; , 07)’, f(y|@,) is the joint density of y 
given 6,, f(0,|i) is the prior density of 0; given the model i, and P(i) is the prior 
probability that the model i is true, for i= 1 and 2. 

There is an alternative way to characterize the Bayesian strategy. Let S be a 
subset of the space of y such that the Bayesian chooses the model 1 if y € S. 
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Table 2.1 The loss matrix 


Decision 


Choose model 1 
Choose model 2 Ly 0 


Then the Bayesian minimizes the posterior risk 
LP(2)P(y € S|2) + Ly,PU)P(y € S|1) (2.1.5) 


with respect to S, where S is the complement of S. It is easy to show that the 
posterior risk (2.1.5) is minimized when S is chosen to be the set of y that 
satisfies the inequality (2.1.3). 

The actual Bayesian solution is obtained by specifying f(y|0,), /(@,li), and 
P(i) in (2.1.4) and the corresponding expressions for P(2|y). This is not done 
here because our main purpose is to understand the basic Bayesian thought, in 
light of which we can perhaps more clearly understand some of the classical 
strategies to be discussed in subsequent subsections. The interested reader 
should consult Gaver and Geisel (1974) or Zellner (1971, p. 306). Gaver and 
Geisel pointed out that if we use the standard specifications, that is, f(y|6)) 
normal, f(8,|i) “diffuse” natural conjugate, P(1) = P(2), the Bayesian solu- 
tion leads to a meaningless result unless K, = K,.* 


2.1.4 Theil's Corrected A? 


Theil (1961, p. 213) proposed a correction of R? aimed at eliminating the 
aforementioned weakness of R?. Theil’s corrected R?, denoted by R2, is de- 
fined by 


T 


— R2= 
1—R T—K 


(1 — RÌ). (2.1.6) 


Because we have from (1.2.9) 
y My 
y'Ly' 
where M —I- X(X'X)"!X' and L =I— T'U’ as before, choosing the 
equation with the largest R? is equivalent to choosing the equation with the 


l- R= (2.1.7) 
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smallest 6? = (T — K)~'y’My. Coming back to the choice between Eqs. 
(2.1.1) and (2.1.2), Theil’s strategy based on his R? amounts to choosing Eq. 
(2.1.1) if 


6? < 6, (2.1.8) 


where o?=(T—K,)"'y’Myy, 
M,-I-X(XiX)"!Xi, 
63 = (T— Kj !y'Myy, and 
M, = I — X (XX) !X;. 


The inequality (2.1.8) can be regarded as a constraint on y and hence defines 
a subset in the space of y. Call it Sọ; that is, So = {yla? < 62). This choice of S 
can be evaluated in terms of the Bayesian minimand (2.1.5). Suppose Eq. 
(2.1.2) is the true model. Then we have 


ai — 6j 19 
"i T-K T-K, 2.19) 
— B:X:MjXPf + 203M, X28, + uM, u, uM, 
T — K, T—K, 
Therefore 
E(61 — 6312) = IM, 9 (2.1.10) 
T— K, 


Therefore, in view of the fact that nothing a priori is known about whether 
61 — 61 is positively or negatively skewed, it seems reasonable to expect that 


Ply € S32) € 4. (2.1.11) 
For a similar reason it also seems reasonable to expect that 
Ply € Soll) <4. (2.1.12) 


These inequalities indicate that Sọ does offer an intuitive appeal (though a 
rather mild one) to the classical statistician who, by principle, is reluctant to 
specify the subjective quantities L,., L21, P(1), and P(2) in the posterior risk 
(2.1.5). 

As we have seen, Theil’s corrected R? has a certain intuitive appeal and has 
been widely used by econometricians as a measure of the goodness of fit. 
However, its theoretical justification is not strong, and the experiences of 
some researchers have led them to believe that Theil's measure does not 
correct sufficiently for the degrees of freedom; that is, it still tends to favor the 
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equation with more regressors, although not as much as the uncorrected R? 
(see, for example, Mayer, 1975). In Section 2.1.5 we shall propose a measure 
that corrects more for the degrees of freedom than Theil’s R? does, and in 
Section 2.1.6 we shall try to justify the proposed measure from a different 
angle. 


2.1.5 Prediction Criterion 


A major reason why we would want to consider any measure of the goodness 
of fit is that the better fit the data have in the sample period, the better 
prediction we can expect to get for the prediction period. Therefore it makes 
sense to devise a measure of the goodness of fit that directly reflects the 
efficiency of prediction. For this purpose we shall compare the mean squared 
prediction errors of the predictors derived from the two competing equations 
(2.1.1) and (2.1.2). To evaluate the mean squared prediction error, however, 
we must know the true model; but, if we knew the true model, we would not 
have the problem of choosing a model. We get around this dilemma by 
evaluating the mean squared prediction error of the predictor derived from 
each model, assuming in turn that each model is the true model. This may be 
called the minimini principle (minimizing the minimum risk)— in contrast to 
the more standard minimax principle defined in Section 2.1.2 — because the 
performance of each predictor is evaluated under the most favorable condi- 
tion for it. Although the principle is neither more nor less justifiable than the 
minimax principle, it is adopted here for mathematical convenience. Using 
the unconditional mean squared prediction error given in Eq. (1.6.11) after a? 
is replaced with its unbiased estimator, we define the Prediction Criterion 
(abbreviated PC) by 


PC, = 61 + TK) i=1,2, (2.1.13) 


for each model, where 6? = (T — Kj) !y' Myy is the number of regressors in 
model i. - 
If we define the modified R?, denoted by R?, as 


1-e@=-F — R2, (2.1.14) 


choosing the equation with the smallest PC is equivalent to choosing the 
equation with the largest A2. A comparison of (2.1.6) and (2.1.14) shows that 
R imposes a higher penalty upon increasing the number of regressors than 
does Theil's R2. 
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Criteria proposed by Mallows (1964) and by Akaike (1973) are similar to 
the PC, although they are derived from somewhat different principles. To 
define Mallows’ Criterion (abbreviated MC), we must first define the matrix X 
as that matrix consisting of distinct column vectors contained in the union of 
X, and X, and Kas the number of the column vectors of X thus defined. These 
definitions of X and K can be generalized to the case where there are more than 
two competing equations. Then we have 


_(2K\(yMy) YMy o... 
MC, Eh) T i— 1,2, (2.1.15) 


where M =I — X(X'X)^!X'. Akaike (1973) proposed what he calls the 
Akaike Information Criterion (abbreviated AIC) for the purpose of distin- 
guishing between models more general than regression models (see Section 
4.5.2), When the AIC is applied to regression models, it reduces to 


=Joe (VMy) 4.2% j= 
AIC, toe ( T )+ T°? i= 1, 2. (2.1.16) 
All three criteria give similar results in common situations (see Amemiya, 
1980a). The MC has one unattractive feature: the matrix X must be specified. 


2.1.6 Optimal Significance Level 


In the preceding sections we have considered the problem of choosing be- 
tween equations in which there is no explicit relationship between the com- 
peting regressor matrices X, and X,. In a special case where one set of regres- 
sorsis contained in the other set, the choice ofan equation becomes equivalent 
to the decision of accepting or rejecting a linear hypothesis on the parameters 
of the broader model. Because the acceptance or rejection of a hypothesis 
critically depends on the significance level chosen, the problem is that of 
determining the optimal significance level (or, equivalently, the optimal criti- 
cal value) of the F test according to some criterion. We shall present the gist of 
the results obtained in a few representative papers on this topic and then shall 
explain the connection between these results and the foregoing discussion on 
modifications of R?. 
Let the broader of the competing models be 


y- Xj u- Xf, - X f +u (2.1.17) 


where X, and X, are T X K, and T X K, matrices, respectively, and for which 
we assume u is normal so that model (2.1.17) is the same as Model 1 with 
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normality. Note that the X, here has no relationship with the X, that appears 
in Eq. (2.1.2). Suppose we suspect f, might be 0 and test the hypothesis £, = 0 
by the F test developed in Section 1.5.2. The appropriate test statistic is 
obtained by putting f, — 0 in Eq. (1.5.19) as 


_ T—K yMyy _ )- E 
n (= xe 1 F(K,, T— K). (2.1.18) 


The researcher first sets the critical value d and then chooses the model 
(2.1.17) if n 2 d or the constrained model 


y- Xj, -u (2.1.19) 


if q « d. 

Conventionally, the critical value dis determined rather arbitrarily in such a 
way that P(r z d) evaluated under the null hypothesis equals a preassigned 
significance level such as 1 or 5 percent. We shall consider a decision-theoretic 
determination of d. For that we must first specify the risk function. The 
decision of the researcher who chooses between models (2.1.17) and (2.1.19) 
on the basis ofthe F statistic n may be interpreted asa decision to estimate fl by 
the estimator fl defined as 


B-B ifgzd (2.1.20) 
-|4| ifn « d, 


where B is the least squares estimator applied to (2.1.17) and B, is that applied 
to (2.1.19). Thus it seems reasonable to adopt the mean squared error matrix 
QO = E( B— BY B— B)’, where the expectation is taken under (2.1.17) as our 
risk (or expected loss) function. However, £2 is not easy to work with directly 
because it depends on many variables and parameters, namely, X, 2, o?, K, K,, 
and d, in addition to having the fundamental difficulty of being a matrix. (For 
the derivation of Q, see Sawa and Hiromatsu, 1973, or Farebrother, 1975.) 
Thus people have worked with simpler risk functions. 

Sawa and Hiromatsu (1973) chose as their risk function the largest charac- 
teristic root of 


IQ'(X'X)Q] '7Q'OQ[Q'(X'X) 1Q] '2, (2.1.21) 


where Q’ = (0, I) where 0 is the K, X K, matrix of zeros and I is the identity 
matrix of size K,. This transformation of Q lacks a strong theoretical justifica- 
tion and is used primarily for mathematical convenience. Sawa and Hiro- 
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matsu applied the minimax regret strategy to the risk function (2.1.21) and 
showed that in the special case K, = 1, d = 1.88 is optimal for most reasonable 
values of T — K. Brook (1976) applied the minimax regret strategy to a differ- 
ent transformation of Q, 


tr XQX’, (2.1.22) 


and recommended d = 2 on the basis of his results. The risk function (2.1.22) 
seems more reasonable than (2.1.21) as it is more closely related to the mean 
squared prediction error (see Section 1.6). At any rate, the conclusions of these 
two articles are similar. 

Now on the basis of these results we can evaluate the criteria discussed in the 
previous subsections by asking what critical value is implied by each criterion 
in a situation where a set of the regressors of one model is contained in that of 
the other model. We must choose between models (2.1.17) and (2.1.19). For 
each criterion, let p denote the ratio of the value of the criterion for model 
(2.1.19) over that for model (2.1.17). Then, using (2.1.18), we can easily 
establish a relationship between y and p. For Theil's criterion we have from 
(2.1.6) 


T-kK 
Kok: 


y ATheil ~ (2.1.23) 


T 
9 => 
Therefore we obtain the well-known result—that Theil’s criterion selects 
(2.1.19) over (2.1.17) if and only if n < 1. Thus, compared with the optimal 
critical values suggested by Brook or by Sawa and Hiromatsu, Theil’s criterion 
imposes far less of a penalty upon the inclusion of regressors. From the 
prediction criterion (2.1.13) we get 


(T — Ky(T t K) —K 
= — . 2.1.24 
(K-Kyr* k)^ 9 KZK, 2.1.24) 
Therefore 
APC)>1 ifandonlyif n> = (2.1.25) 
if and only if 5 TER l. 


Table 2.2 gives the values of 27/(T + K,) for a few selected values of K, /T. 
These values are close to the values recommended by Brook and by Sawa and 
Hiromatsu. The optimal critical value of the F test implied by the AIC can be 
easily computed for various values of K,/T and K/T from (2.1.16) and 
(2.1.18). The critical value for the AIC is very close to that for the PC, although 


Recent Developments in Regression Analysis 55 


Table 2.2 Optimal critical value of the F test implied by PC 


K, 2T 
T T+K, 
1/10 1.82 
1/20 1.90 
1/30 1.94 


it is slightly smaller. Finally, for the MC we have from (2.1.15) 


T+K T — K t 2K, 
n= K— K /MO — -K-K ` (2.1.26) 
Therefore 
AMC)>1 ifandonlyif 74> 2. (2.1.27) 


These results give some credence to the proposition that the modified R? 
proposed here is preferred to Theil’s corrected R? as a measure of the goodness 
of fit. However, the reader should take this conclusion with a grain of salt for 
several reasons: (1) None of the criteria discussed in the previous subsections is 
derived from completely justifiable assumptions. (2) The results in the litera- 
ture of the optimal significance level are derived from the somewhat question- 
able principle of minimizing the maximum regret. (3) The results in the 
literature on the optimal significance level are relevant to a comparison of the 
criteria considered in the earlier subsections only to the extent that one set of 
regressors is contained in the other set. The reader should be reminded again 
that a measure of the goodness of fit is merely one of the many things to be 
considered in the whole process of choosing a regression equation. 


2.2 Ridge Regression and Stein’s Estimator* 
2.2.1 Introduction 


We proved in Section 1.2.5 that the LS estimator is best linear unbiased in 
Model 1 and proved in Section 1.3.3 that it is best unbiased in Model 1 with 
normality. In either case a biased estimator may be better than LS (in the sense 
of having a smaller mean squared error) for some parameter values. In this 
section we shall consider a variety of biased estimators and compare them to 
LS in Model 1 with normality. 
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The biased estimators we shall consider here are either the constrained least 
squares estimator discussed in Section 1.4.1 or the Bayes estimator discussed 
in Section 1.4.4 or their variants. If the linear constraints (1.4.1) are true, the 
constrained least squares estimator is best linear unbiased. Similarly, the 
Bayes estimator has optimal properties if the regression vector f is indeed 
random and generated according to the prior distribution. In this section, 
however, we shall investigate the properties of these estimators assuming that 
the constraints do not necessarily hold. Hence, we have called them biased 
estimators. Even so, it is not at all surprising that such a biased estimator can 
beat the least squares estimator over some region of the parameter space. For 
example, 0 can beat any estimator when the true value of the parameter in 
question is indeed 0. What is surprising is that there exists a biased estimator 
that dominates the least squares estimates over the whole parameter space 
when the risk function is the sum of the mean squared errors, as we shall show. 
Such an estimator was first discovered by Stein (see James and Stein, 1961) 
and has since attracted the attention of many statisticians, some of whom have 
extended Stein’s results in various directions. 

In this section we shall discuss simultaneously two closely related and yet 
separate ideas: One is the aforementioned idea that a biased estimator can 
dominate least squares, for which the main result is Stein’s, and the other is the 
idea of ridge regression originally developed by Hoerl and Kennard (1970a, b) 
to cope with the problem of multicollinearity, Although the two ideas were 
initially developed independently of each other, the resulting estimators are 
close cousins; in fact, the term Stein-type estimators and the term ridge esti- 
mators are synonymous and may be used to describe the same class of estima- 
tors. Nevertheless, it is important to recognize them as separate ideas. We 
might be tempted to combine the two ideas by asserting that a biased estimator 
can be good and is especially so if there is multicollinearity. The statement can 
be proved wrong simply by noting that Stein’s original model assumes 
X'X = I, the opposite of multicollinearity. The correct characterization of the 
two ideas is as follows: (1) Some form of constraint is useful in estimation. (2) 
Some form of constraint is necessary if there is multicollinearity. 

The risk function we shall use throughout this section is the scalar 


E(B — By (B — P), (2.2.1) 


where fi is an estimator in question. This choice of the risk function is as 
general as 


E(B — By A(B — P), (2.22) 
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where A is an arbitrary (known) positive definite matrix, because we can 
always reduce (2.2.2) to (2.2.1) by transforming Model 1 to 


y-Xfctu (2.2.3) 
= XA712A17f + u 
and consider the transformed parameter vector A!2f. Note, however, that 


(2.2.1) is not as general as the mean squared error matrix E( B- BY B- Pr, 


which we used in Section 1.2.4, since (2.2.1) is the trace of the mean squared 
error matrix. 


2.2.2 Canonical Model 


Let H be an orthogonal matrix that diagonalizes the matrix X'X, that is, 
H'H = I and H'X'XH = A, where A is the diagonal matrix consisting of the 
characteristic roots of X' X. Defining X * = XH anda = H'f, we can write Eq. 
(1.1.4) as 


y 7^ X*a t u. (2.2.4) 
If & is the least squares estimator of œ in model (2.2.4), we have 
å — Na, o2A-!). (2.2.5) 


Because the least squares estimator is a sufficient statistic for the vector of 
regression coefficients, the estimation of 8 in Model 1 with normality is 
equivalent to the estimation of œ in model (2.2.5). We shall call (2.2.5) the 
canonical model; it is simpler to analyze than the original model. Because 
HH’ = I, the risk function E- B)’(B — B) in Model 1 is equivalent to the 
risk function E(à — a)’ (& — a) in model (2.2.5). 


2.2.8 Multicollinearity and Principal Components 


In Model 1 we assumed that X is of full rank [that is, rank(X) = K = 77, or, 
equivalently, that X^X is nonsingular. If it is not, X'X cannot be inverted and 
therefore the least squares estimator cannot be uniquely defined. In other 
words, there is no unique solution to the normal equation 


X'Xf = X'y. (2.2.6) 


Even then, however, a subset of the regression parameters still may be esti- 
mated by (1.2.14). 
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We shall now turn to a more general question: Can F’ 8 be estimated by least 
squares, where F is an arbitrary K X fmatrix of rank f( f £ K)? To make sense 
out of this question, we must first define the least squares estimator of Ff. We 
say that the least squares estimator of F’ Bis F’ ff, where f is any solution (which 
may not be unique) of the normal equation (2.2.6), provided F' f is unique. If 
F'f is unique, we also say that F'f is estimable. Then it is easy to prove that 
F’ Bis estimable if and only if we can write F = X'A for some T X fmatrix A, 
or equivalently, if and only if we can write F = X'XB forsome K X fmatrix B. 
(See Rao, 1973, pp. 223-224, for the proof.) If F'f is estimable, it can be 
shown that F’£ is the best linear unbiased estimator of F'f. 

The estimability of F’8 can be reduced to the estimability of a subset of the 
regression parameters in the sense of the previous paragraph by the following 
observation. Let G be a K X (K — f) matrix of rank K — fsuch that G’F = 0. 
(We defined a similar matrix in Section 1.4.2.) Then we can write Model 1 as 


y=Xf+u (2.2.7) 


= X[F(F'F)!, G(G'G)!] le] Btu 


=[Z,, Za] I2] +u, 


where the identity defines Z, , Z,, y,, and y2. Then the estimability of F’ is 
equivalent to the estimability of y, . 

If X'X is singular, f is not estimable in the sense defined above (that is, a 
solution of Eq. 2.2.6 is not unique). This fact does not mean that we should not 
attempt to estimate 8. We can still meaningfully talk about a class of estima- 
tors and study the relative merits of the members of the class. One such class 
may be the totality of solutions of (2.2.6) — infinite in number. Another class 
may be the constrained least squares estimator satisfying linear constraints 
Q'f = c. From Eq. (1.4.11) it is clear that this estimator can be defined even 
when X’X is singular. A third class is the class of Bayes estimators with prior 
Qf — c + v formed by varying Q, c, and the distribution of v. We should 
mention an important member of the first class that also happens to be a 
member of the second class. It is called the principal components estimator. 

Suppose we arrange the diagonal elements of A defined in Section 2.2.2 in 
descending order —4, 2A, = - + - = A,—andletthe corresponding charac- 
teristic vectorsbeh,, h), . . . , hgso that H = (h,, h;, .. . , hy). Then we 
call Xh, the ith principal component of X. If X'X is singular, some of its 
characteristic roots are 0. Partition 
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TA 0 
A= | 0 p] (2.2.8) 


so that the diagonal elements of A, are positive and those of A, are all 0, and 
partition H = (H,, H;) conformably. Furthermore, define Xf = XH, and 
X7 = XH, and partition a’ = (a, a5) conformably. Then X7 = 0 and hence 
a, cannot be estimated. Suppose we estimate a, by 


à, = (XE X*) I XYy (2.2.9) 


and set &, = 0. (It is arbitrary to choose 0 here; any other constant will do.) 
Transforming &’ = (à, à?) into an estimator of f, we obtain the principal 
components estimator of fj by the formula 


Bp = Hâ = H,A7'H{X’y. (2.2.10) 


It is easy to show that Bp satisfies (2.2.6); hence, it is a member of the first class. 
It is also a member of the second class because it is the constrained least 
squares subject to H38 = 0. 

It was shown by Fomby, Hill, and Johnson (1978) that the principal compo- 
nents estimator (or constrained least squares subject to H2f = 0) has a smaller 
variance-covariance matrix than any constrained least squares estimator ob- 
tained subject to the constraints Q'B — c, where Q and c can be arbitrary 
except that Q has an equal or smaller number of columns than H,. 

We shall now consider a situation where X’X is nonsingular but nearly 
singular. The near singularity of X' X is commonly referred to as multicollin- 
earity. Another way of characterizing itis to say that the determinant of X' X is 
close to 0 or that the smallest characteristic root of X' X is small. (The question 
of how small is “small” will be better understood later.) We now ask the 
question, How precisely or imprecisely can we estimate a linear combination 
of the regression parameters c’B by least squares?" Because the matrix H is 
nonsingular, we can write c = Hd for some vector d. Then we have 


V(c'B) = od’ A'd, (2.2.11) 


which gives the answer to the question. In other words, the closer c is to the 
direction of the first (last) principal component, the more precisely (impreci- 
sely) one can estimate c'f. In particular, we note from (2.2.5) that the preci- 
sion of the estimator of an element of a is directly proportional to the corre- 
sponding diagonal element of A. 

Suppose we partition A as in (2.2.8) but this time include all the "large" 
elements in A, and "small" elements in A,. The consideration of which roots 
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to include in A, should depend on a subjective evaluation of the statistician 
regarding the magnitude of the variance to be tolerated. It makes sense to use 
the principal components estimator (2.2.10) also in the present situation, 
because œ, can be only imprecisely estimated. (Here, the principal compo- 
nents estimator is not unique because the choice of œ, is a matter of subjective 
judgment. Therefore it is more precise to call this estimator by a name such as 
the K, principal components estimator specifying the number of elements 
chosen in a.) 


2.2.4 Ridge Regression 
Hoerl and Kennard (1970a, b) proposed the class of estimators defined by 
BG) = (XX + y^ X'y, (2.2.12) 


called the ridge estimators. Hoerl and Kennard chose these estimators be- 
cause they hoped to alleviate the instability of the least squares estimator due 
to the near singularity of X' X by adding a positive scalar y to the characteristic 
roots of X' X. They proved that given f there exists y*, which depends upon p. 
such that E[f(*) — Bl’ [BQ*) — 6] « E(B — By (B — P), where B= A0). 
Because y* depends on fi, B(y*) is nota practical estimator. But the existence of 
B(»*) gives rise to a hope that one can determine y, either as a constant or asa 
function of the sample, in such a way that A(y) is better than the least squares 
estimator f with respect to the risk function (2.2.2) over a wide range of the 
parameter space. 

Hoerl and Kennard proposed the ridge trace methodto determine the value 
of y. The ridge trace is a graph of £,(y), the ith element of f(y), drawn as a 
function of y. They proposed that y be determined as the smallest value at 
which the ridge trace stabilizes. The method suffers from two weaknesses: (1) 
The point at which the ridge trace starts to stabilize cannot always be deter- 
mined objectively. (2) The method lacks theoretical justification inasmuch as 
its major justification is derived from certain Monte Carlo studies, which, 
though favorable, are not conclusive. Although several variations of the ridge 
trace method and many analogous procedures to determine y have been 
proposed, we shall discuss only the empirical Bayes method, which seems to 
be the only method based on theoretical grounds. We shall present a variant of 
it in the next paragraph and more in the next two subsections. 

Several authors interpreted the ridge estimator (more precisely, the class of 
estimators) as the Bayes estimator and proposed the empirical Bayes method 
of determining y; we shall follow the discussion of Sclove (1973). Suppose 
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that the prior distribution of $ is N(u, 031), distributed independently of u. 
Then from (1.4.22) the Bayes estimator of fl is given by 


Bt = (CX yn X'y + yy), (2.2.13) 


where y = o?/a%. Therefore, the ridge estimator (2.2.12) is obtained by putting 
u = 0 in (2.2.13). By the empirical Bayes method we mean the estimation of 
the parameters (in our case, 05) of the prior distribution using the sample 
observations. The empirical Bayes method may be regarded as a compromise 
between the Bayesian and the classical analysis. From the marginal distribu- 
tion (that is, not conditional on £) of y, we have 


Ey'y = of tr X'X + To?, (2.2.14) 
which suggests that we can estimate 03 by 

ny yy- To? 2 

TX (2.2.15) 


where G2 = T-!y'[I — X(X'X)^!X']y as usual. Finally, we can estimate y by 
j= 67/63. 

In the next two subsections we shall discuss many more varieties of ridge 
estimators and what we call generalized ridge estimators, some of which 
involve the empirical Bayes method of determining y. The canonical model 
presented in Section 2.2.2 will be considered. 


2.2.5 Stein's Estimator: Homoscedastic Case 


Let us consider a special case of the canonical model (2.2.5) in which A = I. 
James and Stein (1961) showed that Ell[1 — c(á'à)^!]à — atl? is minimized 
for all œ when c = (K — 2)o? if a? is known and K 2 3, where Ixl? denotes the 
vector product x'x. If we define 


| (K- 27 a 


Stein's estimator &* = [ wa |A 


the result of James and Stein implies in particular that Stein’s estimator is 
uniformly better than the maximum likelihood estimator à with respect to the 
risk function E(& — a)'(& — a) if K = 3. In other words, & is inadmissible (see 
the definition of inadmissible in Section 2.1.2). The fact that à is minimax 
with a constant risk (see Hodges and Lehman, 1950) implies that &* is mini- 
max. This surprising result has had a great impact on the theory of statistics. 

Translated into the regression model, the result of James and Stein implies 
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two facts: (1) Consider the ridge estimator (X’X + yD^!X'/y with y= 
(K — 2)e?/[a'à — (K — 2)0?]. If X"X — I, it is reduced precisely to Stein's 
estimator of f since X’y — N(fi, o?I). Therefore this ridge estimator is uni- 
formly better than the least squares estimator if X'X — I (the opposite of 
multicollinearity). (2) Assume a general X'X in Model 1. If we define A = 
(X'X)'72, we have AB — N(AJ, 071), where Bi is the least squares estimator. 
Applying Stein's estimator to AB, we know EI(1— B)AB — ABI? < 
E IAB — - B)? for all f where B = (K — 2)07/B’X’ Xf. Therefore, equivalently, 
(1— BB is uniformly better than B with respect to the risk function 
E@- By’ X'X(f — f). Note that this is essentially the risk function we pro- 
posed in (1.6.10) in Section 1.6, where we discussed prediction. 

So far, we have assumed o? is known. James and Stein showed that even 
when o? is unknown, if S is distributed independent of & and as oy2, then 
EW[1 — cS(áà' à)! ]à — al? attains the minimum for all œ and o? at c= 
(K — 2)/(n + 2) if K z 3. They also showed that [1 — cS(à'à) !]à is uni- 
formly better than & if 0 < c < (K — 2)/(n + 2). In the regression model we 
can put S = y’[I — X(X'X) !X']y because it is independent of f and distrib- 
uted as 07737. x. 

Efron and Morris (1972) interpreted Stein’s estimator as an empirical Bayes 
estimator. Suppose à — N(a,, 071), where o? is known, and the prior distribu- 
tion of a is N(0, o?y ^T). Then the Bayes estimator is &* = (1 + y) !à = 
(1 — B)à where B = y/(1 + y). The marginal distribution of à is N(0, o?B™~'I). 
Therefore Bà'à/o? — x1. Because E[(y2)-!] = (K — 2)! (see Johnson and 
Kotz, 1970a, vol. 1, p. 166), we have 


z| 527 | - n 


aa 


Thus we can use the term within the square bracket as an unbiased estimator 
of B, thereby leading to Stein’s estimator. 

It is important not to confuse Stein’s result with the statement that 
E(&* — ayà* — a)’ < E(à — a)(& — a)’ in the matrix sense. This inequality 
does not generally hold. Note that Stein’s estimator shrinks each component 
of a by the same factor B. If the amount of shrinkage for a particular compo- 
nent is large, the mean squared error of Stein's estimator for that component 
may well exceed that of the corresponding component of &, even though 
Elà* — all? < Ella — all?. In view of this possibility, Efron and Morris 
(1972) proposed a compromise: Limit the amount of shrinkage to a fixed 
amount for each component. In this way the maximum possible mean 
squared error for the components of a can be decreased, whereas, with luck, 
the sum of the mean squared errors will not be increased by very much. 
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Earlier we stated that the maximum likelihood estimator à is inadmissible 
because it is dominated by Stein’s estimator. A curious fact is that Stein’s 
estimator itself is dominated by 


. » . K-2)o7| . 
Stein’s positive-rule estimator [ — ee a, 
+ 


where [x], denotes max[0, x]. Hence Stein’s estimator is inadmissible, as 
proved by Baranchik (1970). Efron and Morris (1973) showed that Stein’s 
positive-rule estimator is also inadmissible but cannot be greatly improved 
upon. 

We defined Stein's estimator as the estimator obtained by shrinking & 
toward 0. Stein’s estimator can easily be modified in such a way that it shrinks 
& toward any other value. It is easy to show that 


— 2 
Stein’s modified estimator [ —— (K— 2) 


a yis] é- o 


is minimax for any constant vector c. If the stochastic quantity K~'Il’a@ is 
chosen to be c, where lis the vector of ones, then the resulting estimator can be 
shown to be minimax for K z 4. 


2.2.6 Stein's Estimator: Heteroscedastic Case 


Assume model (2.2.5), where A is a general positive definite diagonal matrix. 
Two estimators for this case can be defined. 


Ridge estimator: &* = (A + »I)! Ad, 
^ ^ y 
a*=(1—B)a, where B,— AY 
(Note: The transformation a = H'f translates this estimator into the ridge 
estimator (2.2.12). y is either a constant or a function of the sample.) 
Generalized ridge estimator: à* = (A + T) 'Aá where T is 
diagonal, 
— 


! At Yr 
B* = (X'X + HTH^)'X/y. 
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Other ridge and generalized ridge estimators have been proposed by various 
authors. In the three following ridge estimators, is a positive quantity that 
does not depend on 4;; therefore B; is inversely proportional to 4;. This is an 
intuitively appealing property because it seems reasonable to shrink the com- 
ponent with the larger variance more. In the four following generalized ridge 
estimators, exactly the opposite takes place: The amount of shrinkage B, is an 
increasing function of 4; — an undesirable property. In some of the estimators, 
c? appears in the formula, and in some, its estimate 27, which is assumed to be 
independent of à, appears. As pointed out by Efron and Morris (1976), the 
fundamental properties of Stein’s estimators are not changed if g? is indepen- 
dently estimated. 


Selected Estimators and Their Properties 


Ridge Estimators 
Ridge 1 (Sclove, 1973) 


LO tA 
a’ Aá ` 
Ridge 2 (Hoerl, Kennard, and Baldwin, 1975) and Modified Ridge 2 
(Thisted, 1976) 


_ Kô? 
Oa 
This estimator is obtained by putting A = I in Sclove's estimator. Although 
the authors claimed its good properties on the basis of a Monte Carlo study, 
Thisted (1976) showed that it can sometimes be far inferior to the maximum 
likelihood estimator &; he proposed a modification, y = (K — 2)0?/à' à, and 
showed that the modified estimator is minimax for some A if g? is known. 


Ridge 3 (Thisted, 1976) 


y- ma if all d; < © 
Y, diá? 
imj 
=0 otherwise, 
where 
d= Ar An 


Recent Developments in Regression Analysis 65 


This estimator is minimax for all A if g? is known. If 4; are constant, this 
estimator is reduced to the modified version of Ridge 2. When the A's are too 
spread out, however, it becomes indistinguishable from & (which is minimax). 


Generalized Ridge Estimators 

Generalized Ridge 1 (Berger, 1975) 

p, = KODA 
a’ Aa 

This estimator is minimax for all A and reduces to Stein’s estimator when 4, 
are constant. 

Generalized Ridge 2 (Berger, 1976) 
_ f(&' NÂ) 

ava ` 


Berger (1976) obtained conditions on funder which it is minimax and admis- 
sible for all A. 


Generalized Ridge 3 (Bhattacharya, 1966). B,iscomplicated and therefore 
is not reproduced here, but it is an increasing function of 4, and is minimax for 
all A. 


Generalized Ridge 4 (Strawderman, 1978) 


B = Ë o 
i a’ AG + g6? + h + af 


B, 


where 8? — g?y?. 


This estimator is minimax for all A if 
1 [2(K —-2) | 
gas — |225 
0Sas hen | m +2 
2K 


Results 


All the generalized ridge estimators are minimax for all A and generalized 
ridge 2 is also admissible, whereas among the ridge estimators only Thisted’s 
(which is strictly not ridge because of a discontinuity) is minimax for all A. 


66 Advanced Econometrics 


Because à is minimax with a constant risk, any other minimax estimator 
dominates &. However, the mere fact that an estimator dominates & does not 
necessarily make the estimator good in its own right. If the estimator is admis- 
sible as well, like Berger’s generalized ridge 2, there is no other estimator that 
dominates it. Even that, however, is no guarantee of excellence because there 
may be an estimator (which may be neither minimax nor admissible) that has 
a lower risk over a wide range of the parameter space. It is nice to prove 
minimaxity and admissibility; however, we should look for other criteria 
of performance as weil, such as whether the amount of shrinkage is propor- 
tional to the variance —the criterion in which all the generalized ridge esti- 
mators fail. 

The exact distributions of Stein’s or ridge estimators are generally hard to 
obtain. However, in many situations they may be well approximated by the 
jackknife and the bootstrap methods (see Section 4.3.4). 


2.2.7 Monte Carlo and Applications 


Thisted (1976) compared ridge 2, modified ridge 2, ridge 3, and generalized 
ridge 1 by the Monte Carlo method and found the somewhat paradoxical 
result that ridge 2, which is minimax forthe smallest subset of A, performs best 
in general. 

Gunst and Mason (1977) compared by the Monte Carlo method the esti- 
mators (1) least squares, (2) principal components, (3) Stein's, and (4) ridge 2. 
Their conclusion was that although (3) and (4) are frequently better than (1) 
and (2), the improvement is not large enough to offset the advantages of (1) 
and (2), namely, the known distribution and the ability to select regressors. 

Dempster, Schatzoff, and Wermuth (1977) compared 57 estimators, be- 
longing to groups such as selection of regressors, principal components, 
Stein's and ridge, in 160 normal linear models with factorial designs using 
both E(B — By (B— f) and E(f — p X'X(f — f) as the risk function. The 
winner was their version of ridge based on the empirical Bayes estimation of y 
defined by 


K a ^ 


The fact that their ridge beat Stein's estimator even with respect to the risk 
function E(B — B) X'X(B — B) casts some doubt on their design of Monte 
Carlo experiments, as pointed out by Efron, Morris, and Thisted in the dis- 
cussion following the article. 
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For an application of Stein’s estimator (pulling toward the overall mean), 
see Efron and Morris (1975), who considered two problems, one of which is 
the prediction of the end-of-season batting average from the averages of the 
first forty at-bats. For applications of ridge estimators to the estimation of 
production functions, see Brown and Beattie (1975) and Vinod (1976). These 
authors determined y by modifications of the Hoerl and Kennard ridge trace 
analysis. A ridge estimator with a constant y is used by Brown and Payne 
(1975) in the study of election night forecasts. Aigner and Judge (1977) used 
generalized ridge estimators on economic data (see Section 2.2.8). 


2.2.8 Stein's Estimator versus Pre-Test Estimators 


Let å ~ N(a,, a?I) and SS — o?2 (independent of a). Consider the strategy: 
Test the hypothesis œ = 0 by the F test and estimate o by 0 if the hypothesis is 
accepted (that is, if S7 !&'à = dforanappropriate d) and estimate a by dif the 
hypothesis is rejected. This procedure amounts to estimating o by the estima- 
tor ,@, where J, is the indicator function such that J, = 1 if S^ !&'à > dand0 
otherwise. Such an estimator is called a preliminary-test estimator, or a pre- 
test estimator for short. Sclove, Morris, and Radhakrishnan (1972) proved 
that Ia is dominated by Stein’s positive-rule estimator [1 — (dyS/a’ &)],.a for 
some d, such that d < dy < (K — 2)/(n + 2). 

A pre-test estimator is commonly used in the regression model. Often the 
linear hypothesis Q’ 8 = c is tested, and the constrained least squares estima- 
tor (1.4.11) is used if the F statistic (1.5.12) is smaller than a prescribed value 
and the least squares estimator is used otherwise. An example of this was 
considered in Eq. (2.1.20). The result of Sclove, Morris, and Radhakrishnan 
can be extended to this regression situation in the following way.? 

Let R be as defined in Section 1.4.2 and let A = (Q, RY’. Define Z = XA“, 
y* = y — Xf, and y = A(fl — f) where fis any vector satisfying c = Q’f. Then 
we can write Model 1 as 


y* =Zy+u. (2.2.16) 


Partition Z and y conformably as Z = (Z,, Z;), where Z, = XQ(Q’Q)"! and 
Z, = XR(R'R)-!, and as »'— (y, 3), where 3, 7 Q'(8—1) and y= 
R'(f — f). Then the hypothesis Q’ 8 = c in the model y = Xf + u is equiva- 
lent to the hypothesis y, — 0 in the model y* — Zy t u. 

Define W — (Wo, W,, W2) as W, = (Z2) Z5; W, = (ZZ) "Zi, 
where Z, = [I — Z,(Z5Z,)—'Z3]Z, ; and W, is a matrix satisfying WoW, = 0, 
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WoW, = 0, and WoW, = I. Then we have 
W'y* — N(W'Zy, a'I), (2.2.17) 
and the hypothesis y, = 0 in model (2.2.16) is equivalent to the hypothesis 


W1Zy = 0 in model (2.2.17). Define I, = 1 if S~'y*’Z,(ZiZ,)"'Ziy* > dand 
0 otherwise and define D = (1 — [d9S/y *'Z((Z1Z,) !Ziy*]),. Then the 


aforementioned result of Sclove et al. implies that 


0 0 
Di Wiy* |+(1—-—D) 0 dominates 
ay" Way" 
0 l 0 
I; y'|-ü0-19 0 
Way” 2y* 


in the estimation of W’Zy. Therefore, premultiplying by W (which is the 
inverse of W’), we see that 


DZ(Z'ZyZ'y* + (1 — D)ZZ3Z_)—'"Ziy* = Zy (2.2.18) 
dominates 
LZZ) !Z'y* + (1 — IZ Z3Z,)'Z5y* = Zy (2.2.19) 


in the estimation of Zy. Finally, we conclude that Stein’s positive-rule estima- 
tor y defined by (2.2.18) dominates the pre-test estimator j defined by (2.2.19) 
in the sense that E(y — y) Z' Z(y — y») s E — yy Z' Z(y — y) for all y. 

Although the preceding conclusion is the only known result that shows the 
dominance of a Stein-type estimator over a pre-test estimator (with respect to 
a particular risk function), any Stein-type or ridge-type estimator presented in 
the previous subsections can be modified in such a way that "shrinking" or 
“pulling” is done toward linear constraints Q'f = c. 

We can assume Q'Q = I without loss of generality because if Q’Q # I, we 
can define Q* = Q(Q'Q)*'? and c* = (Q'Q)'7c so that Q*'8 = c* and 
Q*'Q* = I. Denoting the least squares estimator of f by B, we have 


Q'É — c — N[Q'B — c, c?Q'(X'X)-!QI. (2.2.20) 


Defining the matrix G such that G’G = I and G’Q’(X’X)"'QG = Z^, = 
diagonal, we have 


G'(Q'B — c) — NIG'(Q8 — 9), o? 271]. (2.221) 
Therefore, if B is the diagonal matrix with the ith diagonal element B; defined 
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under any of the minimax estimators presented in Section 2.2.6, we have 
(I- B)G'(Q'À — c) « G'(Q'À—c) for G'(Q'8-c, (2.2.22) 
where “6, < Ô, for 0" means that 
E(6, — 0) (6, — 6) s E(Ó, — 6)'(6, — 6) 


for all 0 with strict inequality holding for at least one value of 8. Therefore we 
have 


GDG'(Q'À—c) « Q'B—c for Q'f—c, (2.2.23) 
where D = I — B, and consequently 

GDG'(Q'À —o) +e] , [QB QB 

| R'j |< E for EA 0228) 


where R is as defined in Section 1.4.2 with the added condition R’R = I. Then 


|a. | [00027 9* j « B for f, (2.2.25) 


which can be simplified as 
B-QGBG’(Q’B-c) « B for f. (2.2.26) 


Let us consider a concrete example using generalized ridge estimator 1. 
Putting 


(K — 2)e? 
B = = 
a' X 


in the left-hand side of (2.2.26), we obtain the estimator 


p- (K-2e9 — '(X'X)-!Q]-! " . 
B Q$- rioan o gA 9 UO" QQ- o) 


(2.2.27) 


Aigner and Judge (1977) applied the estimator (2.2.27) and another estima- 
tor attributed to Bock (1975) to the international trade model of Baldwin 
(1971) and compared these estimates to Baldwin’s estimates, which may be 
regarded as pre-test estimates because Baldwin utilized a certain linear restric- 
tion. Aigner and Judge concluded that the conditions under which Bock’s 
estimator is minimax are not satisfied by the trade data and that although 
Berger’s estimator (2.2.27) is always minimax, it gives results very close to the 
least squares in the trade model. 
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2.3 Robust Regression 
2.3.2 Introduction 


In Chapter 1 we established that the least squares estimator is best linear 
unbiased under Model 1 and best unbiased under Model 1 with normality. 
The theme of the previous section was essentially that a biased estimator may 
be better than the least squares estimator under the normality assumption. 
The theme of this section is that, in the absence of normality, a nonlinear 
estimator may be better than the least squares estimator. We shall first discuss 
robust estimation in the i.i.d sample case, and in the next subsection we shall 
generalize the results to the regression case. 


2.3.2 Independent and Identically Distributed Case 


Lety,,y;, . . . , Yr be independent observations from a symmetric distribu- 
tion function F[(y — 4)/c] such that F(0) = 4. Thus y is both the population 
mean and the median. Here o represents a scale parameter that may not 
necessarily be the standard deviation. Let the order statistics be y = 
Ya) = ++ * Sy». We define the sample median ff to be Yor+1y2) if T is odd 
and any arbitrarily determined point between Vr; and yir, ; if T is even. It 
has long been known that // would be a better estimator of u than the sample 
mean fi = T^! X7 , y, if F has heavier tails than the normal distribution. 
Intuitively speaking, this is because 4 is much less sensitive to the effect of a 
few wild observations than / is. 

It can be shown that #/ is asymptotically normally distributed with mean z 
and variance [47/(0)?]~!, where f is the density of F.!? Using this, we can 
compare the asymptotic variance of / with that of / for various choices of F. 
Consider the three densities 


1 
Normal —en 
V2n 
Laplace 5 e 


l 1 
zx (1 + x?) 


Table 2.3 shows the asymptotic variances of 4 and # under these three distri- 
butions. Clearly, the mean is better than the median under normality but the 


Cauchy 
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Table 2.3 Asymptotic variances (times sample size) of the sample mean and the 
sample median under selected distributions 


TV(À) TV 
Normal 1 1.57 
Laplace 2 1 
Cauchy eo 2.47 


median outperforms the mean in the case of the other two long-tailed distri- 
butions. (Note that the mean is the maximum likelihood estimator under 
normality and that the median, because it minimizes Z|y, — bj, is the maxi- 
mum likelihood estimator under the Laplace distribution.) In general, we call 
an estimator, such as the median, that performs relatively well under distribu- 
tions heavier-tailed than normal a “robust” estimator. To be precise, there- 
fore, the word robustness should be used in reference to the particular class of 
possible distributions imagined. A comparison of the variances of the mean 
and the median when the underlying class of distributions is a mixture of two 
normal distributions can be found in Bickel and Doksum (1977, p. 371), 
which has an excellent elementary discussion of robust estimation. 
Another robust estimator of location that has long been in use is called the 
a-trimmed mean, which is simply the mean ofthe sample after the proportion 
a of largest and smallest observations have been removed. These and other 
similar robust estimators were often used by statisticians in the nineteenth 
century (see Huber, 1972; Stigler, 1973). However, the popularity of these 
robust estimators declined at the turn ofthe century, and in the first half ofthe 
present century the sample mean or the least squares estimator became the 
dominant estimator. This change occurred probably because many sophisti- 
cated testing procedures have been developed under the normality assump- 
tion (mathematical convenience) and because statisticians have put an undue 
confidence in the central limit theorem (rationalization). In the last twenty 
years we have witnessed a resurgence of interest in robust estimation among 
statisticians who have recognized that the distributions of real data are often 
significantly different from normal and have heavier tails than the normal in 
most cases. Tukey and his associates in Princeton have been the leading 
proponents of this movement. We should also mention Mandelbrot (1963), 
who has gone so far as to maintain that many economic variables have infinite 
variance. However, it should be noted that the usefulness of robust estimation 
is by no means dependent on the unboundedness of the variance; the occur- 
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rence of heavier tails than the normal is sufficient to ensure its efficacy. For a 
survey of recent developments, the reader is referred to Andrews et al. (1972), 
who reported on Monte Carlo studies of 68 robust estimators, and to Huber 
(1972, 1977, 1981), Hogg (1974), and Koenker (1981a). 

Robust estimators of location can be classified into four groups: M, L,, L, 
and R estimators. M, L, and R estimators are the terms used by Huber (1972). 
L, estimators constitute a subset of the class M, but we have singled them out 
because of their particular importance. We shall briefly explain these classes of 
estimators and then generalize them to the regression case. 


M Estimator. The M estimator (stands for “maximum-likelihood-type” 
estimator) is defined as the value of b that minimizes 27. ,p[( y, — b)/s] where s 
is an estimate of the scale parameter o and p is a chosen function. If p is twice 
differentiable and its second derivative is piecewise continuous with 
Ep' [Cy, — uso] = O where sg is the probability limit of s,!! we can use the 
results of Chapter 4 to show that the M estimator is asymptotically normal 
with mean yz and variance 


Etp' Iso C — 10) 
MUCIUS Qn 


Note that when p(A) = 2?, this formula reduces to the familiar formula for 
variance of the sample mean. 
Consider an M estimator proposed by Huber (1964). It is defined by 


T's 


pz) = 42? if |z| «c (2.3.2) 


—-cz-—ic if |z|zc, 


where z = (y — u)/sand cis to be chosen by the researcher. (The Monte Carlo 
studies of Andrews et al. (1972) considered several values of c between 0.7and 
2.) Huber (1964) arrived at the p function in (2.3.2) as the minimax choice 
(doing the best against the least favorable distribution) when F(z)— 
(1 — €)®(z) + eH(z), where z = (y — u)/o, H varies among all the symmetric 
distributions, 4 is the standard normal distribution function, and e is a given 
constant between 0 and 1. The value of c depends on e in a certain way. As for 
s, one may choose any robust estimate of the scale parameter. Huber (1964) 
proposed the simultaneous solution of 


T — 
Sr (2 - 2) =0 2.3.3) 
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and 
T 2 LÀ 
, n=?) T 25-22 
= í znd 2.3. 
$o ( ; Tin | Pepe "dz (2.3.4) 


in terms of b and s. Huber’s estimate of u converges to the sample mean or the 
sample median as c tends to % or 0, respectively. 

Another M estimator shown to be robust in the study of Andrews et al. 
(1972) is the following one proposed by Andrews (1974). Its p function is 
defined by 


—p(z)= 1+ cos z if |z2| Sz (2.3.5) 
=0 if [2| zt, 
where z = (y — b)/sas before. Andrews’ choice of s is (2.1) Median (|y, — Zl). 


L, Estimator. The L, estimator is defined as the value of b that minimizes 
EI ly, — b|*. Values of p between | and 2 are the ones usually considered. 
Clearly, p = 2 yields the sample mean and p = 1 the sample median. For any 
p * 1, the asymptotic variance of the estimator is given by (2.3.1). The ap- 
proximate variance for the case p — 1 (the median) wasgiven earlier. Note that 
an estimate ofa scale parameter need not be used in defining the L, estimator. 


L Estimator. The L estimator is defined as a linear combination of order 
statistics yu S yo; S °° * € yr. The sample median and the a-trimmed 
mean discussed at the beginning of this subsection are members of this class. 
(As we have seen, the median is a member of L, and hence of M as well.) 

Another member of this class is the Winsorized mean, which is similar to a 
trimmed mean. Whereas a trimmed mean discards largest and smallest obser- 
vations, a Winsorized mean “accumulates” them at each truncation point. 
More precisely, it is defined as 


T-[(g + DYygen + Yen t ^o EY- +(g+ Dya-al (2.3.6) 


for some integer g. 

A generalization of the median is the 0th sample quantile, 0 < 0 < 1; thisis 
defined as yg, where k is the smallest integer satisfying k > T0 if TO is not an 
integer, and an arbitrarily determined point between Yirg and y(re.,., if TO isan 
integer. Thus 0 = 4 corresponds to the median. It can be shown that the 0th 
sample quantile, denoted by (0), minimizes 


Y, Ay, — bit Y, 0 — Oly, — bl. (2.3.7) 
yb y»«b 
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Previously we have written (4) simply as £i. Gastwirth (1966) proposed a 
linear combination of quantiles 0.3/2(1) + 0.4/4) + 0.3/(3) as a robust esti- 
mator of location. Its asymptotic distribution can be obtained from the follow- 
ing general result attributed to Mosteller (1946): Suppose 0« 0, € 
O<: <6,<1. Then Z(0),1(0;, . . . , A(O) are asymptotically 
jointly normal with the means equal to their respective population quantiles, 
A(0,), (85), . . . , u(0,) (that is, 0, = F[u(0)]), and variances and covar- 
iances given by 


6(1 — 0) 
Tf[u(0))/Ia(8)]* 
R Estimator. The rank is a mapping from n real numbers to the integers 1 


through z in such a way that the smallest number is given rank 1 and the next 
smallest rank 2 and so on. The rank estimator of 4, denoted by *, is defined as 


Cov [4(0), 1(0)] = wy = isj. (2.3.8) 


follows: Construct a sequence of n = 2T observations x,, X2, . . . ,X, by 
definingx, = y,—b,i—1,2,..., T,andxpr,;7b—y,i71,2, ..., T, 
and let their ranks be R,, R;, . . . , Ra. Theny*isthe value of b that satisfies 


T R, 
» J (z + :) =0, 
where J is a function with the property S1 J(4) dA = 0. Hodges and Lehmann 
(1963) proposed setting J(A) = A — 4. For this choice of J(4), u* can be shown 
to be equal to Median (( y; + y,)/2}, 1 S$ i £ j £ T. It is asymptotically normal 
with mean 4 and variance 


1 
12 r| Í NC a| 


Remarks 


(2.3.9) 


We have covered most of the major robust estimators of location that have 
been proposed. Of course, we can make numerous variations on these estima- 
tors. Note that in some of the estimation methods discussed earlier there are 
parameters that are left to the discretion of the researcher to determine. One 
systematic way to determine them is the adaptive procedure, in which the 
values of these parameters are determined on the basis of the information 
contained in the sample. Hogg (1974) surveyed many such procedures. For 
example, the o of the a-trimmed mean may be chosen so as to minimize an. 
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estimate of the variance of the estimator. Similarly, the weights to be used in 
the linear combination of order statistics may be determined using the 
asymptotic variances and covariances of order statistics given in (2.3.8). 

Most of the estimators discussed in this subsection were included among the 
nearly seventy robust estimators considered in the Monte Carlo studies of 
Andrews et al. (1972). Their studies showed that the performance of the 
sample mean is clearly inferior. This finding is, however, contested by Stigler 
(1977), who used real data (eighteenth and nineteenth century observations 
on physical constants such as the speed of light and the mean density of the 
earth for which we now know the true values fairly accurately). He found that 
with his data a slightly trimmed mean did best and the more “drastic” robust 
estimators did poorly. He believes that the conclusions of Andrews et al. are 
biased in favor of the drastic robust estimators because they used distributions 
with significantly heavy tails as the underlying distributions. Andrews et al. 
did not offer definite advice regarding which robust estimator should be used. 
This is inevitable because the performance of an estimator depends on the 
assumed distributions. These observations indicate that it is advisable to per- 
form a preliminary study to narrow the range of distributions that given data 
are supposed to follow and decide on which robust estimator to use, if any. 
Adaptive procedures mentioned earlier will give the researcher an added 
flexibility. 

The exact distributions of these robust estimators are generally hard to 
obtain. However, in many situations they may be well approximated by 
methods such as the jackknife and the bootstrap (see Section 4.3.4). 


2.3.3 Regression Case 


Let us generalize some of the estimation methods discussed earlier to the 
regression situation. 


M Estimator. The M estimator is easily generalized to the regres- 
sion model: It minimizes 22, p[(y, — x;b)/s] with respect to the vector 
b. Its asymptotic variance-covariance matrix is given by 
S(X' AX)^'X'BX(X' AX)^!, where A and B are diagonal matrices with the t th 
diagonal elements equal to Ep" [(y, — x/B)/so] and E{p’[(y, — x;)/so]?), re- 
spectively. 

Hill and Holland (1977) did a Monte Carlo study on the regression general- 
ization of Andrews' M estimator described in (2.3.5). They used s — (2.1) 
Median (largest T — K + 1 of |y, — x/f|) as the scale factor in the p function, 
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where Bi is the value of b that minimizes Z7 ,|y, — x;b|. Actually, their estima- 
tor, which they called one-step sine estimator, is defined as 


Bs = (X’DX)"'X’Dy, (2.3.10) 


where D is the diagonal matrix whose ‘th diagonal element d, is defined by 
— v. 81-1 v - 
d,= E XP sin E XH if |y—x;fissr — (23.10) 
=0 if |y,—x/B|> sn. 


This is approximately the first step of the so-called Newton-Raphson iteration 
designed to minimize (2.3.5), as we shall show next. 

Put g(f) = 22, p(z,) where z, = (y, — x;f)/sand expand g( 8) around f = B 
in a Taylor series as 


ge gÂ 28 E -B*ig- BY LL d - B. Q.3.12) 


ap p 
where the derivatives are evaluated at fj. Let B be the value of f that minimizes 
the right-hand side of (2.3.12). Thus 


2 x dg T ag 
= -| =. (2.3.13) 

B=- Ey af 
This is the first step in the Newton-Raphson iteration (see Section 4.4.1). 
Inserting 


a 5 sip’ X, (2.3.14) 


t=1 
and 
eg 
Bop X STP" XX, (2.3.15) 
where p’ and p” are evaluated at (y, — x,'B)/s, into (2.3.13), we obtain 
T -1 T 
= (X 222 * (s^ !p'x, — $5 ?p' "xx Â). (2.3.16) 
t-1 mi 
Finally, inserting a Taylor approximation 
p! = p'O) + s (y, — xt yp" (2.3.17) 
into the right-hand side of (2.3.16), we obtain the estimator (2.3.10). 
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In their Monte Carlo study, Hill and Holland assumed that the error term is 
N(0, 1) with probability 1 — œ and N(0, c?) with probability a, with various 
values for c and a. The regressor matrix was artificially generated with various 
degrees of outlying observations and with the number of regressors ranging 
from 1 to 6. The sample size was chosen to be 20. The reader should look at the 
tablei in their article (Hill and Holland, 1977) to see the striking i improvement 
of B or Bs over the least squares estimator and the minor improvement of Bs 
over B. 


L, Estimator. This class also can easily be generalized to the regression 
model. It is the value of b that minimizes 2/_,|y, — x;b|". A special case in 
which p = 1, which was already defined as £ in the discussion about the M 
estimator, will be more fully discussed later as an L estimator. For p # 1, the 
asymptotic variance of the L, estimator can be obtained by the same formula 
as that used for the asymptotic variance of the M estimator. Forsythe (1972) 
conducted a Monte Carlo study of L, estimators for p = 1.25, 1.5, 1.75, and 2 
(least squares) in a regression model with one fixed regressor and an intercept 
where the error term is distributed as GN(O, 1) + (1 — G)N(S, R) for several 
values of G, S, and R. His conclusion: The more “contamination,” the smaller 
p should be. 


L Estimator. The 0th sample quantile can be generalized to the regression 
situation by simply replacing b by x;b in the minimand (2.3.7), as noted by 
Koenker and Bassett (1978). We shall call the minimizing value the 6th 
sample regression quantile and shall denote it by A(6). They investigated 
the conditions for the unique solution of the minimization problem and ex- 
tended Mosteller’s result to the regression case. They established that 


BO, ), KO.), e. , ÃO are asymptotically normal with the means equal to 
B+ u(0,), B+u(6,), ...,B+n(6,), where u(0) = [u(0), 0,0, .. . , 07’, 
and the variance-covariance matrix is given by 

Cov [A(G), K0)) = oX X, FSi, (2.3.18) 


where c, is given in (2.3.8). A proof for the special case BA) = Bis also given in 
Bassett and Koenker (1978) (see Section 4.6.2). 

Blattberg and Sargent (1971) conducted a Monte Carlo study and com- 
pared f), the least squares estimator, and one other estimator in the model with 
the regressor and no intercept, assuming the error term has the characteristic 
function exp(—|oA|*) for a = 1.1, 1.3, 1.5, 1.7, 1.9, and 2.0. Note that a = 2 
gives the normal distribution and a = 1 the Cauchy distribution. They found 
that f did best in general. 
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Schlossmacher (1973) proposed the following iterative scheme to obtain B: 


Buy = (XDX)-!'X'Dy, (2.3.19) 
where D is the diagonal matrix whose tth diagonal element is given by 
dp Xiha- if I xiu € (2.3.20) 
=0 otherwise, 


where Bo is the estimate obtained at the ith iteration and e is some predefined 
small number (say, € = 1077). Schlossmacher offered no theory of conver- 
gence but found a good convergence in a particular example. 

Fair (1974) proposed the following alternative to (2.3.20): 


df = min (ly, — xi, ^, €). (2.3.21) 


Thus Fair bounded the weights from above, whereas Schlossmacher threw out 
observations that should be getting the greatest weight possible. It is clear that 
Fair’s method is preferable because his weights are continuous and nonin- 
creasing functions of the residuals, whereas Schlossmacher’s are not. 

A generalization of the trimmed mean to the regression model has been 
proposed by several authors. We shall consider two such methods. The first, 
which we shall call A(@) for 0 < a < 4, requires a preliminary estimate, which 
we shall denote by fl. The estimation process involves calculating the resid- 
uals from f, throwing away those observations corresponding to the [Ta] 
smallest and [Ta] largest residuals, and then calculating f(a) by least squares 
applied to the remaining observations. ` 

The second method uses the 0th regression quantile f(0) of Koenker and 
Bassett (1978) mentioned earlier. This method involves removing from the 
sample any observations that have a residual from £a) that is negative or a 
residual from A(1 — a) that is positive and then calculating the LS estimator 
using the remaining observations. This estimator is denoted by f* (o). 

Ruppert and Carroll (1980) derived the asymptotic distribution of the two 
estimators and showed that the properties of Aa) are sensitive to the choice of 
the initial estimate f, and can be inefficient relative to 8*(a). However, if 
By = i[A(0) + BC — @)] and the distribution of the error term is symmetric, 
f(a) is asymptotically equivalent to fl* (a). 


R Estimator. This type of regression estimator was proposed by Jaeckel 
(1972). He wrote the regression model as y = Bol + Xf + u, where X is now 
the usual regressor matrix except I, the column of ones. Jaeckel's estimator 
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minimizes D(y — Xb) - EL,[R,- (T+ 1)/2](y,— xib), where R,= 
rank( y, — x;b). Note that R, is also a function of b. Jaeckel proved that Disa 
nonnegative, continuous, and convex function of b and that his estimator is 
asymptotically normal with mean f and variance-covariance matrix 


elx 1- V tx} (2.3.22) 
T 3. 


where t? = 12^ '![f f *(u) du]—2. The ratio o2/t? is known as the Pitman effi- 
ciency of the Wilcoxon rank test and is equal to 0.955 if fis normal and 
greater than 0.864 for any symmetric distribution, whereas its upper bound is 
infinity. Because the derivative of D exists almost everywhere, any iterative 
scheme of minimization that uses only the first derivatives can be used. 
(Second derivatives are identically zero.) The intercept ff, may be estimated by 
the Hodges-Lehmann estimator, Median((4; + 4)/2), 1 SiS j S T, whereü 
is the vector of the least squares residuals. See the articles by McKean and 
Hettmansperger (1976) and Hettmansperger and McKean (1977) for tests of 
linear hypotheses using Jaeckel's estimator. 


Exercises 


1. (Section 2.1.3) 
Show that the Bayesian minimand (2.1.5) is minimized when Sis chosen to 
be the set of y that satisfies the inequality (2.1.3). 


2. (Section 2.1.5) 
A weakness of PC is that it does not choose the right model with probability 
1 when T goes to infinity. (The weakness, however, is not serious.) Suppose 
we must choose between regressor matrix X, and X such that X, C X. Show 
that 


lim P[PC chooses Xj|X, is true] = P[yz-x, < 2K] € 1. 


3. (Section 2.1.5) 
Schwartz's (1978) criterion minimizes T log y' M;y t K; log T. Show that 
this criterion chooses the correct model with probability 1 as T goes to œ. 


4. (Section 2.2.3) ` . 
If F'f is estimable, show that F'f is the BLUE of F'f, where f is the LS 
estimator. 
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5. (Section 2.2.3) 
Show that fl; defined in (2.2.10) is a solution of (2.2.6). 


6. (Section 2.2.4) 
Show that for any square matrix A there exists a positive constant yọ such 
that for all y > yọ, A + yI is nonsingular. 


3 Large Sample Theory 


Large sample theory plays a major role in the theory of econometrics because 
econometricians must frequently deal with more complicated models than 
the classical linear regression model of Chapter |. Few finite sample results are 
known for these models, and, therefore, statistical inference must be based on 
the large sample properties of the estimators. In this chapter we shall present a 
brief review of random variables and the distribution function, discuss various 
convergence theorems including laws of large numbers and central limit 
theorems, and then use these theorems to prove the consistency and the 
asymptotic normality of the least squares estimator. Additional examples of 
the application of the convergence theorems will be given. 


3.1 A Review of Random Variables and the Distribution Function 


This section is not meant to be a complete discussion ofthe subject; the reader 
is assumed to know the fundamentals of the theory of probability, random 
variables, and distribution functions at the level of an intermediate textbook 
in mathematical statistics.! Here we shall introduce a few concepts that are not 
usually dealt with in intermediate textbooks but are required in the subse- 
quent analysis, in particular, the rigorous definition of a random variable and 
the definition of the Stieltjes integral.? 


3.1.1 Random Variables 


At the level of an intermediate textbook, a random variable is defined as a 
real-valued function over a sample space. But a sample space is not defined 
precisely, and once a random variable is defined the definition is quickly 
forgotten and a random variable becomes identified with its probability dis- 
tribution. This treatment is perfectly satisfactory for most practical applica- 
tions, but certain advanced theorems can be proved more easily by using the 
fact that a random variable is indeed a function. 

We shall first define a sample space and a probability space. In concrete 
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terms, a sample space may be regarded as the set of all the possible outcomes of 
an experiment. Thus, in the experiment of throwing a die, the six faces of the 
die constitute the sample space; and in the experiment of measuring the height 
of a randomly chosen student, the set of positive real numbers can be chosen 
as the sample space. As in the first example, a sample space may be a set of 
objects other than numbers. A subset ofa sample space may be called an event. 
Thus we speak of the event ofan ace turning up or the event ofan even number 
showing in the throw of a die. With each event we associate a real number 
between 0 and ! called the probability of the event. When we think ofa sample 
space, we often think of the other two concepts as well: the collection of its 
subsets (events) and the probabilities attached to the events. The term proba- 
bility space refers to all three concepts collectively. We shall develop an ab- 
stract definition of a probability space in that collective sense. 

Given an abstract sample space Q, we want to define the collection A of 
subsets of Q that possess certain desired properties. 


DEFINITION 3.1.1. The collection A of subsets of Q is called a o-algebra if it 
satisfies the properties: 
() QEA. 
(ii) E€ A — E € X. (E refers to the complement of E with respect 
to Q.) 
(iii) E € A, f=l,2,... UR, BEA, 


Given a o-algebra, we shall define over it a real-valued set function satisfy- 
ing certain properties. 


DEFINITION 3.1.2. A probability measure, denoted by P( - ), is a real-valued 
set function that is defined over a o-algebra A and satisfies the properties: 
(i) EE A> P(E) 20. 
(ii) P(Q) = 1. 
(iii) If (Ej) is a countable collection of disjoint sets in A, then 


*(u z) = > PE,). 
J J 

A probability space and a random variable are defined as follows: 
DEFINITION 3.1.3. Given a sample space Q, a c-algebra A associated with 


Q, and a probability measure P(-) defined over A, we call the triplet 
(Q, A, P) a probability space? 
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DEFINITION 3.1.4. A random variable on (Q, A, P) is a real-valued func- 
tion^ defined over a sample space Q, denoted by X(w) for w € Q, such that for 
any real number x, 


{@|X(@) € x) € A. 


Let us consider two examples of probability space and random variables 
defined over them. 


EXAMPLE 3.1.1. Inthesample space consisting ofthe six faces ofa die, all the 
possible subsets (including the whole space and the null set) constitute a 
c-algebra. A probability measure can be defined, for example, by assiging 1/6 
to each face and extending probabilities to the other subsets according to the 
rules given by Definition 3.1.2. An example ofa random variable defined over 
this space is a mapping of the even-numbered faces to one and the odd-num- 
bered faces to zero. 


EXAMPLE 3.1.2. Let a sample space be the closed interval [0, 1]. Consider 
the smallest o-algebra containing all the open sets in the interval. Such a 
c-algebra is called the collection of Borel sets or a Borel field. This c-algebra 
can be shown to contain all the countable unions and intersections of open 
and closed sets. A probability measure of a Borel set can be defined, for 
example, by assigning to every interval (open, closed, or half-open and half- 
closed) its length and extending the probabilities to the other Borel sets ac- 
cording to the rules set forth in Definition 3.1.2. Such a measure is called 
Lebesgue measure.’ In Figure 3.1 three random variables, X, Y, and Z, each of 
which takes the value 1 or 0 with probability 4, are depicted over this probabil- 


X(w) Y(w) Z(w) 


[9] 
€ ne 
o 
nl— 
jo) 
s- 
e N- 
Alw 


Figure 3.1 Discrete random variables defined over [0, 1] with Lebesgue measure 
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ity space. Note that Z is independent of either X or Y, whereas X and Yare not 
independent (in fact XY = 0). A continuous random variable X(w) with the 
standard normal distribution can be defined over the same probability space 
by X = $^ !(o), where ® is the standard normal distribution function and 
“— |" denotes the inverse function. 


3.1.2 Distribution Function 


DEFINITION 3.1.5. The distribution function F(x) ofa random variable X(w) 
is defined by 


F(x) = P(o|X(o) < x). 


Note that the distribution function can be defined for any random variable 
because a probability is assigned to every element of A and hence to 
(c|X(o) < x} for any x. We shall write P{w|X(w) < x} more compactly as 
P(X < x). 

A distribution function has the properties: 

(i) F(—9) = 0. 
(ii) F(@) = 1. 

(iii) It is nondecreasing and continuous from the left. 

[Some authors define the distribution function as F(x) = P(w|X(@) £ x). 
Then it is continuous from the right.] 

Using a distribution function, we can define the expected value of a random 
variable whether it is discrete, continuous, or a mixture of the two. This is 
done by means of the Riemann-Stieltjes integral, which is a simple generaliza- 
tion of the familiar Riemann integral. Let X be a random variable with a 
distribution function F and let Y = A(X), where A(- ) is Borel-measurable.® 
We define the expected value of Y, denoted by EY as follows. Divide an 
interval [a, b] into n intervals with the end points a=x»<x,<...< 
Xn-1 < x, = band let x7 be an arbitrary point in [x;, x;4,,]. Define the partial 
sum 


s- Wx? Fc.) — Fool (3.1) 
j= 


associated with this partition of the interval (a, b]. If, for any € > 0, there exists 
areal number A and a partition such that for every finer partition and for any 
choice of x7, |S, — A] < e, we call A the Riemann-Stieltjes integral and denote 
it by f^h(x) dF(x). It exists if A is a continuous function except possibly for a 
countable number of discontinuities, provided that, whenever its discontinu- 
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ity coincides with that of F, it is continuous from the right.’ Finally, we define 


EY= Í ” x) dF(x)- lim Í ” M) FQ, 6.12) 


b—» 


provided the limit (which may be +œ or —) exists regardless of the way 
a — —9 and b >œ, 

If dF/dx exists and is equal to f(x), F(x) — F(xj) ^ Sf Xx+ — x;) for 
some xf? € [x,,,, x;] by the mean value theorem. Therefore 


Eh(X) = Í ° A(x) f(x) dx. (3.1.3) 


On the other hand, suppose X = c; with probability pj, i= 1,2,... ,K. 
Take a € c, and cy < b; then, for sufficiently large n, each interval contains at 
most one of the c/'s. Then, of the n terms in the summand of (3.1.1), only K 
terms containing c;'s are nonzero. Therefore 


b K 
f h(x) dF(x) = p h(c;)p;. (3.1.4) 


3.2 Various Modes of Convergence 


In this section, we shall define four modes of convergence for a sequence of 
random variables and shall state relationships among them in the form of 
several theorems. 


DEFINITION 3.2.1 (convergence in probability). A sequence of random vari- 
ables (X,) is said to converge to a random variable X in probability if 
lim,.. P(X, — X|> €) = 0 for any e > 0. We write X, +. X or plim X, = X. 


DEFINITION 3.2.2 (convergence in mean square). A sequence (X,,) is said to 
converge to X in mean square if lim,_.. E(X,, — Xy = 0. We write X, M x. 


DEFINITION 3.2.3 (convergence in distribution). A sequence (X, ) is said to 
converge to X in distribution if the distribution function F, of X, converges to 
the distribution function F of X at every continuity point of F. We write 
X, <, X, and we call F the limit distribution of (X,). If (X, ) and (Y,,) have the 
same limit distribution, we write X, = Y,. 


The reason for adding the phrase “at every continuity point of F” can be 
understood by considering the following example: Consider the sequence 
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F,(*) such that 


F,(x) = 0, x«a-— - (3.2.1) 


1, ati<x, 
n 


Then lim F, is not continuous from the left at œ and therefore is not a 
distribution function. However, we would like to say that the random variable 
with the distribution (3.2.1) converges in distribution to a degenerate random 
variable which takes the value a with probability one. The phrase “at every 
continuity point of F” enables us to do so. 


DEFINITION 3.2.4 (almost sure convergence). A sequence (X,) is said to 
converge to X almost surely? if 


P(eo|lim Xw) = X(w)} = 1. 


We write X, S X. 


The next four theorems establish the logical relationships among the four 
modes of convergence, depicted in Figure 3.2.9 


THEOREM 3.2.1 (Chebyshev). EX2— 0 > X, > 0. 
Proof. We have 


Ex? = f ° x? dF,(x) 2 €? Í aF (x), (3.2.2) 
—o S 


M — P ——-d 


Figure 3.2 Logical relationships among four modes of convergence 
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where S = (x|x? = «?). But we have 


Í dF,(x) = Í C dF x) + Í ” OF (x) (3.2.3) 
S — € 


= F,(—€) + [1 — F,(e)] 
= P(X, € —€) + P(X, = e) 
2 P[X2» e?]. 

Therefore, from (3.2.2) and (3.2.3), we obtain 


EX? 


PIX? > &] S x (3.2.4) 


The theorem immediately follows from (3.2.4). 


The inequality (3.2.4) is called Chebyshev's inequality. By slightly modify- 
ing the proof, we can establish the following generalized form of Chebyshev's 
inequality: 

P[g(X,) > €] £ He, (3.2.5) 
where g(' ) is any nonnegative continuous function. 

Note that the statement X, M, y—yx ^ - x, , Where X may be either a con- 
stant or a random variable, follows from Theorem 3.2.1 if we regard X, — X as 
the X, of the theorem. 

We shall state the next two theorems without proof. The proof of Theorem 
3.2.2 can be found in Mann and Wald (1943) or Rao (1973, p. 122). The proof 
of Theorem 3.2.3 is left as an exercise. 


THEOREM 3.2.2. X, hx X, 4X. 
THEOREM 3.2.3. X, X — X, >X. 


The converse of Theorem 3.2.2 is not generally true, but it holds in the 
special case where X is equal to a constant a. We shall state it as a theorem, the 
proof of which is simple and left as an exercise. 


THEOREM 3.2.4. X, 5» —» X, >a. 


The converse of Theorem 3.2.3 does not hold either, as we shall show by a 
well-known example. Define a probability space (Q, A, P) as follows: Q = 
[0, 1], A = Lebesgue-measurable sets in [0, 1], and P = Lebesgue measure as 
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in Example 3.1.2. Define a sequence of random variables X,,(q@) as 


X,(@) = 1 forO=wsl 


X,(w) = 1 for S055 
=0 elsewhere 
X;(@) = 1 frlisosl.l 
3 2 | 2 3 


=0 elsewhere 


I 1,1 
= = < — — -S s 
X,(@) = 1 for0 SS 7 Zand, +3508! 


=0 elsewhere 


=0 elsewhere 


In other words, the subset of Q over which X, assumes unity has the total 
length 1/n and keeps moving to the right until it reaches the right end point of 
[0, 1], at which point it moves back to 0 and starts again. For any 1 > € > 0,we 
clearly have 


1 
PUX,|> 97 - 


and therefore X, +, 0. However, because Ez i7! = œ, there is no element in 
Q for which lim, ,.. X,(@) = 0. Therefore P{c|lim,_... X,(@) = 0} = 0, im- 
plying that X, does not converge to 0 almost surely. 

The next three convergence theorems are extremely useful in obtaining the 
asymptotic properties of estimators. 


THEOREM 3.2.5 (Mann and Wald). Let X, and X be K-vectors of random 
variables and let g(*) be a function from R* to R such that the set E of 
discontinuity points of g(*) is closed and P(X € E) —O. If X, 4 X, then 
&(X,) ^ g(X). 


A slightly more general theorem, in which a continuous function is replaced 
by a Borel measurable function, was proved by Mann and Wald (1943). The 
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convergence in distribution of the individual elements of the vector X, to the 
corresponding elements of the vector X is not sufficient for obtaining the 
above results. However, if the elements of X, are independent for every n, the 
separate convergence is sufficient. 


THEOREM 3.2.6. Let X, be a vector of random variables with a fixed finite 
number of elements. Let g be a real-valued function continuous at a constant 
vector point a. Then X, — a = g(X,) 2, ga). 


Proof. Continuity at œ means that for any e > 0 we can find ó such that 
IX, — all < ô implies |g(X,,) — g(@)| < €. Therefore 


PLWX, — all < 6] S Piig(X,) — g(o)| < e]. (3.2.6) 


The theorem follows because the left-hand side of (3.2.6) converges to 1 by the 
assumption of the theorem. 


THEOREM 3.2.7 (Slutsky). If X, 55 X and Y, Z, o, then 
(i) X, Y, j»X a, 
(ii) X,Y, aX, 
(iii) (X, /Y,) 5» X/a, provided a # 0. 


The proof has been given by Rao (1973, p. 122). By repeated applications of 
Theorem 3.2.7, we can prove the more general theorem that if g is a rational 


function and pim Y,,—20;,1—1,2,...,J, and X,, > X, jointly in all i= 
1,2,.. . , K, then the limit distribution of g(X,,, X2,,- . . »>Xxn> Yin» Yos; 
...» Y jy) is the same as the distribution of g(X,, X2,. . . , Xx, Qi, 05, 


. ,@,). This is perhaps the single most useful theorem in large sample 
theory. 
The following definition concerning the stochastic order relationship is 
useful (see Mann and Wald, 1943, for more details). 


DEFINITION 3.2.5. Let (X,) be a sequence of random variables and let 
(a,) be a sequence of positive constants. Then we can write X, = o(a,) if 
plim,_... a; 1X, = 0 and X, = O(a,) if for any e > 0 there exists an M, such 
that 

P[a;!|X, s M,.]z1—e 
for all values of n. 


Sometimes these order relationships are denoted o, and O, respectively to 
distinguish them from the cases where (X,) are nonstochastic. However, we 
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use the same symbols for both stochastic and nonstochastic cases because 
Definition 3.2.5 applies to the nonstochastic case as well in a trivial way. 


3.3 Laws of Large Numbers and Central Limit Theorems 


Given a sequence of random variables {X,}, define X , = n7! Ze X,. Alawof 
large numbers (LLN) specifies the conditions under which X, — EX , con- 
verges to 0 either almost surely or in probability. The former is called a strong 
law of large numbers and the latter a weak law. In many : applications the 
simplest way to show X, — EX, 2,0 is to show X, — EX, M, 0 and then 
apply Theorem 3.2.1 (Chebyshev). We shall state two strong laws of large 
numbers. For proofs, the reader is referred to Rao (1973, pp. 114—115). 


THEOREM 3.3.1 {Kolmogorov LLN 1). Let (X,) be e independent with finite 
variance VX, = 02. If 3%, 02/t? < ©, then X, — EX, ^» 0. 


THEOREM 3.3.2. (Kolmogorov LLN 2). Let (.X,) be i.i.d. (independent and 
identically distributed). Then a necessary and sufficient condition that X , —> 
4 is that EX, exists and is equal to x. 


Theorem 3.3.2 and the result obtained by putting X , — EX , into the X, of 
Theorem 3.2.1 are complementary. If we were to use Theorem 3.2.1 to prove 
X, — EX, > 0, we would need the finite variance of X,, which Theorem 3.3.2 
does not require; but Theorem 3.3.2 requires (X,) to be i.i.d., which Theorem 
3.2.1 does not.'? 

When all we want is the proof of X, — EX, 4 0, Theorem 3.3.1 does not 
add anything to Theorem 3.2.1 because the assumptions of Theorem 3.3.1 
imply X,— EX, 4 0. The proof is left as an exercise. 

Now we ask the question, What is an approximate distribution of X „when n 
is large? $ Suppose a law of large numbers holds for a sequence x 1) so that 
X,— EX, =, 0. It follows from Theorem 3.2.2 that X, — EX, 55 0. How- 
ever, it is an uninteresting limit distribution because it is degenerate. It is 
more meaningful to inquire into the limit distribution of Z, = (VX,) !? 
(X, — EX „). For, if the limit distribution of Z, exists, it should be nondegen- 
erate because VZ, = 1 for all n. A central limit theorem (CLT) specifies the 
conditions under which Z, converges in distribution to a standard normal 
random variable [we shall write Z, — N(0, 1)]. 

We shall state three central limit theorems — Lindeberg-Lévy, Liapounov, 
and Lindeberg-Feller — and shall prove only the first. For proofs of the other 
two, see Chung (1974) or Gnedenko and Kolmogorov (1954). Lindeberg- 
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Lévy and Liapounov are special cases of Lindeberg-Feller, in the sense that the 
assumptions of either one of the first two central limit theorems imply those of 
Lindeberg-Feller. The assumptions of Lindeberg-Lévy are more restrictive in 
some respects and less restrictive in other respects than those of Liapounov. 
Before we state the central limit theorems, however, we shall define the 
characteristic function of a random variable and study its properties. 


DEFINITION 3.3.1. The characteristic function of a random variable X is 
defined by Ee, 


Thus if the distribution function of X is F( : ), the characteristic function is 
J2.. e?* dF(x). The characteristic function is generally a complex number. 
However, because e“ = cos Ax + i sin Ax, the characteristic function of a 
random variable with a density function symmetric around 0 is real. The 
characteristic functon of N(0, 1) can be evaluated as 


A. [p exp (ax S) dx e (3) (3.3.1) 

VIn J-» 2 27 UU 
Define g(A) = Log f2. e?* dF(x), where Log denotes the principal loga- 

rithm.!! Then the following Taylor expansion is valid provided EX” exists: 


Y: 


g(d) = i" = + oan, (3.3.2) 


where x, = (0/2/0#),-0 M The coefficients x; are called the cumulants of X. 
The first four cumulants are given by x, = EX, k, = VX, k, = E(X — EXy, 
and k, = E(X — EX} — X(VXy. 

The following theorem is essential for proving central limit theorems. 


THEOREM 3.3.3. If Eexp(iAX,)— Eexp(iAX) for every A and if 
E exp(iAX) is continuous at A = 0, then X, SY. 


The proof of Theorem 3.3.3 can be found in Rao (1973, p. 119). 
We can now prove the following theorem. 


THEOREM 3.3.4 (Lindeberg-Lévy CLT). Let (X,} be iid. with EX, = p and 
VX, = a7. Then Z, — N(0, 1). 


Proof. We assume u = 0 without loss of generality, for we can consider 
(X, — u} if u # 0. Define g(A) = Log E exp (iAX,). Then from (3.3.2) we have 


gü)-- oe + o(22). (3.3.3) 
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We have, using (3.3.3), 


n iAX, 
Log E exp (iAZ,,) = © Log E ex Fa (3.3.4) 
g E exp (iAZ,,) p BE exp} 77 


Therefore the theorem follows from (3.3.1) and Theorem 3.3.3. 


THEOREM 3.3.5 (Liapounov CLT). Let (X,) be independent with EX, = 4,, 
VX, = o?, and E[|X, — u,?] = My. If 


. n -u[a M3 
lim [š J È ms | =0, 
no? Lim] t-1 

then Z, — N(0, 1). 


THEOREM 3.3.6 (Lindeberg-Feller CLT). Let (X,) be independent with 
distribution functions (F,) and EX,— u, and VX,— 02. Define C,— 
(22107). If 


n 

im 23 | (x — Y. dF,(x) = 0 
n9 O n te1 Sxl >eCy 

for every € > 0, then Z,, — N(0, 1). 

In the terminology of Definition 3.2.3, central limit theorems provide con- 
ditions under which the limit distribution of Z, = (VX „y (X, — EX,) is 
N(0, 1). We now introduce the term asymptotic distribution. It simply means 
the "approximate distribution when 7 is large." Given the mathematical 
result Z, IN (0, 1), we shall make statements such as “the asymptotic distri- 
bution of Z, is N(0, 1)" (written as Z,, 4 N(0, 1)) or “the asymptotic distribu- 
tion of X, is N(EX,, VX,).” These statements should be regarded merely as 
more intuitive paraphrases of the result Z, £, N(0, 1). Note that it would be 
meaningless to say “‘the limit distribution of X, is N(EX,, VX,)." 

When the asymptotic distribution of X, is normal, we also say that X, is 
asymptotically normal. 

The following theorem shows the accuracy of a normal approximation to 
the true distribution (see Bhattacharya and Rao, 1976, p. 110). 


THEOREM 3.3.7. Let(X,) bei.i.d. with EX, = u, VX, = g?, and E| X? = m,. 
Let F, be the distribution function of Z, and let ® be that of N(0, 1). Then 
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1 


FG) — D) = (0.7975) 72 M 


for all x. 


It is possible to approximate the distribution function F, of Z, more accu- 
rately by expanding the characteristic function of Z, in the powers of n^ !?. If 
EX? exists, we can obtain from (3.3.2) 


KAY. | k,GAY* 


Soh | 24on (3.3.5) 


E exp (iAZ,) = ec??? | + 


K3(iAy 
+ 720°n 


+ on] . 


This is called the Edgeworth expansion (see Cramér, 1946, p. 228). Because 
(see Cramér, 1946, p. 106) 


Í e?x h(x) dx = (— iAy e7”, (3.3.6) 


where 6(x) is the rth derivative of the density of N(0, 1), we can invert (3.3.5) 
to obtain 


F(x) = B(x) — OOK) +5 


PEX- ap 90) (3.3.7) 


+ 


K3 Do -3/2 
7065 D(X) + O(n^??), 

We shall conclude this section by stating a multivariate central limit 
theorem, the proof of which can be found in Rao (1973, p. 128). 


THEOREM 3.3.8. Let (X,) be a sequence of K-dimensional vectors of ran- 
dom variables. If c'X, converges to a normal random variable for every K-di- 
mensional constant vector c # 0, then X, converges to a multivariate normal 
random variable. (Note that showing convergence of each element of X, 
separately is not sufficient.) 


3.4 Relationships among lim E, AE, and plim 


Let F, be the distribution function of X, and F, — Fatcontinuity points of F. 
We have defined plim X, in Definition 3.2.1. We define lim E and AE as 
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follows: 


lim £X, = lim Í x dF,(x) (3.4.1) 


and 


AE X, = Í x dF(x). (3.4.2) 
In words, AE, which reads asymptotic expectation or asymptotic mean, is the 
mean of the limit distribution. 

These three limit operators are similar but different; we can construct 
examples of sequences of random variables such that any two of the three 
concepts either differ from each other or coincide with each other. We shall 
state relationships among the operators in the form of examples and theorems. 
But, first, note the following obvious facts: 

(i) Of the three concepts, only plim X, can be a nondegenerate random 
variable; therefore, if it is, it must differ from lim EX, or AE X,. 

(ii) If plim X, = a, a constant, then AE X, = a. This follows immediately 
from Theorem 3.2.2. 


EXAMPLE 3.4.1. Let X, be defined by 
X, = Z with probability (n — 1)/n 
= n with probability 1/n, 
where Z ~ N(0, 1). Then plim X, = Z, lim EX, = 1, and AE X, = EZ =0. 
EXAMPLE 3.4.2. Let X, be defined by 
X, = 0 with probability (n — 1)/n 
= n? with probability 1/n. 
Then plim X, = AE X, = 0, and lim EX, = lim n = c. 


EXAMPLE 3.4.3. Let X~ N(a,, 1) and Y, — N(B, n^!), where f # 0. Then 
X/Y, is distributed as Cauchy and does not have a mean. Therefore 
lim E(X/Y,,) cannot be defined either. But, because Y, z B, AE(X/Y,,) = a/B 
by Theorem 3.2.7 (iii). 


The following theorem, proved in Rao (1973, p. 121), gives the conditions 
under which lim E = AE. 
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THEOREM 3.4.1. If E|X,|' € M for all n, then lim EX; = AE X3 for any 
s <r. In particular, if EX? < M, then lim EX, = AE X,. (Note that this con- 
dition is violated by all three preceding examples.) 


We are now in a position to define two important concepts regarding the 
asymptotic properties of estimators, namely, asymptotic unbiasedness and 
consistency. 


DEFINITION 3.4.1. The estimator 6, of 0 is said to be asymptotically unbi- 
ased if AE 6, = 0. We call AE 0, — 0 the asymptotic bias. 


Note that some authors define asymptotic unbiasedness using lim E instead 
of AE. Then it refers to a different concept. 


DEFINITION 3.4.2. Theestimator 6, of ĝ is said to be a consistent estimator if 
plim 6, = 0. 


Some authors use the term weakly consistent in the preceding definition, to 
distinguish it from the term strong consistency used to describe the property 
6, => 9.2 

In view of the preceding discussions, it is clear that a consistent estimator is 


asymptotically unbiased, but not vice versa. 


3.5 Consistency and Asymptotic Normality of Least Squares Estimator 


The main purpose of this section is to prove the consistency and the asympto- 
tic normality of the least squares estimators of fj and ao? in Model 1 (classical 
linear regression model) of Chapter |. The large sample results of the preced- 
ing sections will be extensively utilized. At the end of the section we shall give 
additional examples of the application of the large sample theorems. 


THEOREM 3.5.1. In Model | the least squares estimator is a consistent esti- 
mator of fl if A,(X'X) — o, where A,(X’X) denotes the smallest characteristic 
root of X’X.}3 


Proof. The following four statements are equivalent: 
(i) A,(X’X) — o. 
(ii) A[(X/X)^!] — 0, where A, refers to the largest characteristic root. 
(iii) tr(X'X)! — 0. 
(iv) Every diagonal element of (X’X)~' converges to 0. 
Statement (iv) implies the consistency of f by Theorem 3.2.1. 
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The reader should verify that it is not sufficient for consistency to assume 
that all the diagonal elements of X’X go to infinity. 


THEOREM 3.5.2. If we assume that {u,} arei.i.d. in Model 1, the least squares 
estimator ô? defined in (1.2.5) is a consistent estimator of c?. 


Proof. We can write 


6? = T-wu — T-^!w'Pu, (3.5.1) 
where P = X(X'X) !X'. By Theorem 3.3.2 (Kolmogorov LLN 2), 
Tw © o. (3.5.2) 


Because u’Pu is nonnegative, we can use the generalized Chebyshev inequality 
(3.2.5) to obtain 


P(T— ‘uw Pu > €) Se ' ET“ Pu = oE ' TO'K. (3.5.3) 
Therefore 
T-'wPu > 0. (3.5.4) 


Therefore the theorem follows from (3.5.1), (3.5.2), and (3.5.4) by using 
Theorem 3.2.6. 


We shall prove the asymptotic normality of the least square estimator Ê in 
two steps: first, for the case of one regressor, and, second, for the case of many 
regressors. 


THEOREM 3.5.3. Assume that (1,) are iid. in Model 1. Assume that K = 1, 
and because X is a vector in this case, write it as x. If 


lim (x’x)"? max x2 — 0, (3.5.5) 
Te JársT 
then o7 !(x/x) (fi — B) > NOO, 1). 
Proof. We prove the theorem using the Lindeberg-Feller CLT (Theorem 
3.3.6). Take x,u, as the X, of Theorem 3.3.6. Then u, = 0 and o? = 07x?. Let 


Fy, and Fz be the distribution functions of X and Z, respectively, where. 
X? — Z. Then 


Í x? ar yx) = [ z dF ,(z) 
x2>¢2 zc 
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for any c. (This can be verified from the definition of the Riemann-Stieltjes 
integral given in Section 3.1.2.) Therefore we need to prove 


| T 
lim —— = 
lim x p Í ana a dF (a) =0 (3.5.6) 


for any e, where F, is the distribution function of (x,u,)*. Let G be the distribu- 
tion function of u?. Then, because F(a) = P(x2u2 < a) = P(u? < a/x?) = 
G(a/x?), we have 


Í a dF (o) = Í a dG(a/x?) 
a> e2q2x'x a>e2G2x’x 


= Í x24 dG(A). 
A> €? x,;?x'x 


Therefore (3.5.6) follows from the inequality 


Í AdG(A) £ Í A dG(A) (3.5.7) 
A> PATIK >en xj -!x'x 


and the assumption (3.5.5). 


THEOREM 3.5.4. Assume that (u,) are i.i.d. in Model 1. Assume also 
lim (x;ix,) ! max x2—0 (3.5.8) 
T9 1srsT 

for every i= 1,2,. . . , K. Define Z = XS~', where S is the K X K diago- 


nal matrix the ith diagonal element of which is (xjx,)'”, and assume that 
lim; Z/Z = R exists and is nonsingular. Then 


S(B — B) > N(0, R`’). 


Proof We have S(f—)-(Z'ZyZ/u. The limit distribution of 
e'(Z'Z) !Z/u for any constant vector c is the same as that of y’Z’u where 
y’ = e R! But, because of Theorem 3.5.3, the asymptotic normality of y’Z’u 
holds if 


K 2 
` Igi -1 = 
lim (9 Z'Zy) max ( » 9 =0, (3.5.9) 


where y, is the kth element of y and z,, is the t,kth element of Z. But, because 
y'Z'Zy 2 A,(Z/2)y'y by Theorem 10 of Appendix 1, we have 
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K 2 K 
(y'Z'Zy)! max ($ 29 s (y'Z'Zy) !y'y max p zł% (3.5.10) 
t =j t -] 
S 
. max È Z% 
~ AZZ) 
E Xk 


1 
= —Ó—. max ; 
A(Z’Z) : p XX, 


where the first inequality above follows from the Cauchy-Schwarz inequality. 
Therefore (3.5.9) follows from (3.5.10) because the last term of (3.5.10) con- 
verges to 0 by our assumptions. Therefore, by Theorem 3.5.3, 


y'Z'u 
o(y Z/Zyy? 
for any constant vector y ** 0. Since y’Z’Zy — c'R'!c, we have 
y'Z'u  N(0, c?c'R^!c). 


— N(0, 1) 


Thus the theorem follows from Theorem 3.3.8. 


At this point it seems appropriate to consider the significance of assumption 
(3.5.5). Note that (3.5.5) implies x'x — o. It would be instructive to try to 
construct a sequence for which x’x — œ and yet (3.5.5) does not hold. The 
following theorem shows, among other things, that (3.5.5) is less restrictive 
than the commonly used assumption that lim T~'x’x exists and is a nonzero 
constant. It follows that iflim 77! X'X exists and is nonsingular, the condition 
of Theorem 3.5.4. is satisfied. 


THEOREM 3.5.5. Given a sequence of constants (x,), consider the state- 
ments: 
(i) limz,, 7-!cp— a, wherea#0,a<%, and c= ZL, x2. 
(ii) limp. Cp 9. 
(ii) limy_.,. cr'x2. = 0. 
(iv) limy_.. c7! max;s,sr x? = 0. 
Then, (i) = [(ii) and (iii)] = (iv). 


Proof. (i) = (ii) is obvious. We have 


Txt Cr Cr Cr 
ixt er er L CT g, 541 
TT-D T T-1 (3.5.11) 
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Therefore lim;_... (T — 1)^!x2.— 0, which implies (i) = (iii). We shall now 
prove [(ii) and (iii)] = (iv). Given e > 0, there exists T, such that for any 
T2T, 

c7ixi<e (3.5.12) 
because of (iii). Given this 7, there exists T, > T, such that for any T> T, 


c;! max x?<e (3.5.13) 
ITE 


because of (ii). But (3.5.12) implies that for any T > T, 

c7ix?<e for t=7,,7,+1,...,T. (3.5.14) 
Finally, (3.5.13) and (3.5.14) imply (iv). 
It should be easy to construct a sequence for which (3.5.5) is satisfied but (i) 
1s not. 


Next we shall prove the asymptotic normality of the least square estimator 
of the variance o?, 


THEOREM 3.5.6. Assume that (1,) are Li.d. with a finite fourth moment 
Eu = m, in Model 1. Then VT (6? — o?) > N(0, m, — a^). 


Proof. We can write 

wu-To? 1 
VT VT 

The second term of the right-hand side of (3.5.15) converges to 0 in probability 

by the same reasoning as in the proof of Theorem 3.5.2, and the first term can 


be dealt with by application of the Lindeberg-Lévy CLT (Theorem 3.3.4). 
Therefore the theorem follows by Theorem 3.2.7(i). 


VT (6? — 0?) = w'Pu. (3.5.15) 


Let us look at a few more examples of the application of convergence 
theorems. 
EXAMPLE 3.5.1. Consider Model 1 where K = 1: 
y 7 fix ^ u, (3.5.16) 


where we assume lim 7~'x’x = c # 0 and (u,) arei.i.d. Obtain the probability 
limit of Bp = y’y/x’y. (Note that this estimator is obtained by minimizing the 
sum of squares in the direction of the x-axis.) 
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xu , uu 
2 LI 
MELLE 
Br= — ana . (3.5.17) 
Bs 
x’x 


We have E(x’u/x’x)? = g?*(x'x)! —0 as T — o. Therefore, by Theorem 
32.1, 


. xu 
l =0. 5. 
plim xx 0 (3.5.18) 
Also we have 
. wu . [wu /x'x c? 
l =pl — /—|=— 3.5.19 
Pm x plim | e /2x] c ( ) 


because of Theorem 3.2.6 and Theorem 3.3.2 (Kolmogorov LLN 2). There- 
fore, from (3.5.17), (3.5.18), and (3.5.19) and by using Theorem 3.2.6 again, 
we obtain 


22 e 
plim fj, =f + fc (3.5.20) 
Note that c may be allowed to be œ, in which case Br becomes a consistent 
estimator of ff. 


EXAMPLE 3.5.2. Consider the same model as in Example 3.5.1 except 
that we now assume lim T—?x'x- o. Also assume lim;..(x'x)' 
maX; ssr X? = 0 so that f is asymptotically normal. (Give an example of a 
sequence satisfying these two conditions.) Show that 8 = x'y/x'x and fj, = 
y'y/x'y have the same asymptotic distribution. 

Clearly, plim f = plim fj, = f. Therefore, by Theorem 3.2.2, both estima- 
tors have the same degenerate limit distribution. But the question concerns 
the asymptotic distribution; therefore we must obtain the limit distribution of 
each estimator after a suitable normalization, We can write 


pa- B) - — 
Ges), — B) = ——— — S9. (3.5.21) 


xu 
x'x 


But by our assumptions plim u'u( x'x)^!2 = 0 and plim (x'u/x'x) = 0. There- 
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fore (x’x)/( Br — fl) and (x'x)X B — fl) have the same limit distribution by 
repeated applications of Theorem 3.2.7. 


EXAMPLE 3.5.3. Consider Model 1 with K = 2: 
y ^ fix, + Bx, + U, (3.5.22) 


where we assume that (1,) arei.i.d. Assumealsolim 7~'X’X = A, where Aisa 
2 X 2 nonsingular matrix. Obtain the asymptotic distribution of B, IÊ , where 
b and B, are the least squares estimators of f, and £}, assuming fl, # 0. 

We can write 


IF E - A. | -JF ES - B ; = BB, =A] (3.523) 
2 2 2P2 


Because plim b. = fa, the right-hand side of (3.5.23) has the same limit 
distribution as 


Bi T (Êi — B) — Bi? T (B, — fr). 


But, because our assumptions imply (3.5.8) by Theorem 3.5.5, [VT T(f, - pi 
J/T( [ ß2)] converges to a bivariate normal variable by Theorem 3.5.4. 
Therefore, by Theorem 3.2.5, we have 


[A ^. | - wo. oy A'Y), 
, k 


where y’ = (f1!, — 8,837). 


Exercises 


1. (Section 3.1.2) 
Prove that the distribution function is continuous from the left. 


2. (Section 3.2) 
Prove Theorem 3.2.3. HINT: Definition 3.1.2 (iii) implies that if Q, C Qm 
for n < m and lim,.,, Q, = A, then limp» P(Q,) = P(A). 


3. (Section 3.2) 
Prove Theorem 3.2.4. 


4. (Section 3.3) 
Let (X) be as defined in Theorem 3.3.1. Prove 
lim, .0 E(X, — EX, — 0. 
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5. 


10. 


(Section 3.3) 
Let {a,}, t=1,2,..., be a nonnegative sequence such that 
(ZZ, a,)/T < M for some M and every T. Prove limy_..27.,(a,/t?) < œ. 


. (Section 3.3) 


Prove that the conditions of Theorem 3.3.6 (Lindeberg-Feller CLT) fol- 
low from the conditions of Theorem 3.3.4 (Lindeberg-Lévy CLT) or of 
Theorem 3.3.5 (Liapounov CLT). 


. (Section 3.3) 


Let (X,) be iid. with EX, = u. Then X, -» u. This is a corollary of 
Theorem 3.3.2 (Kolmogorov LLN 2) and is called Khinchine's WLLN 
(weak law of large numbers). Prove this theorem using characteristic 
functions. 


. (Section 3.5) 


Show that 4,(X'X) — o implies xx; — © for every i, where x; is the ith 
column vector of X. Show also that the converse does not hold. 


. (Section 3.5) 


Assume K = 1 in Model 1 and write X as x. Assume that (u,) are indepen- 
dent. If there exist L and M such that 0 € L « x'x/T « M for all T, show 


p> p. 


(Section 3.5) 

Suppose y = y* + u and x = x* + y, where each variable is a vector of T 
components. Assume y* and x* are nonstochastic and (u,, v,) is a bivar- 
iate i.d. random variable with mean 0 and constant variances o2, 
a}, respectively, and covariance o,,. Assume y* = x", but y* and x* are 
not observable so that we must estimate f on the basis of y and x. Ob- 
tain the probability limit of f—x'y/x'x on the assumption that 
lim, 7^!x*'x* = M. 


. (Section 3.5) 


Consider the regression equation y = X,8, + Xf, + u. Assume all the 
assumptions of Model 1 except that X = (X,, X,) may not be full rank. 
Let Z be the matrix consisting of a maximal linearly independent subset 
of the columns of X, and assume that the smallest characteristic root of 
the matrix (X,, Z)'(X,, Z) goes to infinity as T goes to infinity. Derive a 
consistent estimator of f,. Prove that it is consistent. 


15. 


16. 


17. 
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. (Section 3.5) 


Change the assumptions of Theorem 3.5.3 as follows: {u,} are indepen- 
dent with Eu,=0, Vu,=o?, and E|u,?-— m,. Prove o^ (x'x)'? 
( B- p) ^ N(0, 1) using Theorem 3.3.5 (Liapounov CLT). 


. (Section 3.5) 


Construct a sequence (x,) such that £Z, x? — œ but the condition (3.5.5) 
is not satisfied. 


. (Section 3.5) 


Construct a sequence (x,) such that the condition (3.5.5) holds but the 
condition (i) of Theorem 3.5.5. does not. 


(Section 3.5) 
Let lbe the vector ofones. Assuming lim T~'I’x* = N # Oin the model of 
Exercise 10, prove the consistency of f = Y'y/l'x and obtain its asymptotic 
distribution. 


(Section 3.5) 

Assume that (u,) are iid. in Model 1. Assume K= 1 and write 
X as x. Obtain the asymptotic distribution of p= Vy/V’x assuming 
lim;_.. T7 {Fx} = o where | is the vector of ones. 


(Section 3.5) 

Consider the classical regression model y = ax + fz + u, where a and f 
are scalar unknown parameters, x and z are 7-component vectors of 
known constants, and u is a 7-component vector of unobservable i.i.d. 
random variables with zero mean and unit variance. Suppose we are given 
an estimator B that is independent of u and the limit distribution of 
TVX B- p) is N(O, 1). Define the estimator & by 


HENCE fa) 


a 
x’x 


Assuming lim J~'x’x = c and lim T~!x’z = d, obtain the asymptotic 
distribution of à. Assume c + 0 and d * 0. 


. (Section 3.5) 


Consider the regression model y = B(x + al) + u, where y, x, I, and u are 
T-vectors and a and f are scalar unknown parameters. Assume that lis a 
T-vector of ones, lim; ,, x'1 = 0, and lim;_,.. 7^ !x'x = c, where c is a 
nonzero constant. Also assume that the elements of u are i.i.d. with zero 
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mean and constant variance g?. Supposing we have an estimate of a 
denoted by á such that it is distributed independently of u and 
VT (å — a) > N(0, 42), obtain the asymptotic distribution of f defined 
by 

a (x + al)’y 
B= (x + ly (x + al)’ 


4 Asymptotic Properties 
of Extremum Estimators 


By extremum estimators we mean estimators obtained by either maximizing 
or minimizing a certain function defined over the parameter space. First, we 
shall establish conditions for the consistency and the asymptotic normality of 
extremum estimators (Section 4.1), and second, we shall apply the results to 
important special cases, namely, the maximum likelihood estimator (Section 
4.2) and the nonlinear least squares estimator (Section 4.3). 

What we call extremum estimators Huber called M estimators, meaning 
maximum-likelihood-like estimators. He developed the asymptotic proper- 
ties in a series of articles (summarized in Huber, 1981). The emphasis here, 
however, will be different from his. The treatment in this chapter is more 
general in the sense that we require neither independent nor identically dis- 
tributed random variables. Also, the intention here is not to strive for the least 
stringent set of assumptions but to help the reader understand the fundamen- 
tal facts by providing an easily comprehensible set of sufficient conditions. 

In Sections 4.4 and 4.5 we shall discuss iterative methods for maximization 
or minimization, the asymptotic properties of the likelihood ratio and asymp- 
totically equivalent tests, and related topics. In Section 4.6 we shall discuss the 
least absolute deviations estimator, for which the general results of Section 4. 1 
are only partially applicable. 


4.1 General Results 
4.1.1 Consistency 


Because there is no essential difference between maximization and minimiza- 
tion, we shall consider an estimator that maximizes a certain function of the 
parameters. Let us denote the function by Qj4(y,0) where y= 
(Yis Yn - - - » Yr)’ isa T-vector of random variables and @ is a K-vector of 
parameters. [We shall sometimes write it more compactly as Q,(8).] The 
vector 0 should be understood to be the set of parameters that characterize the 
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distribution of y. Let us denote the domain of 6, or the parameter space, by O 
and the “true value” of 0 by @. The parameter space is the set of all the 
possible values that the true value 0, can take. When we take various opera- 
tions on a function of y, such as expectation or probability limit, we shall use 
the value 4. 

Asa preliminary to the consistency theorems, we shall define three modes of 
uniform convergence of a sequence of random variables. 


DEFINITION 4.1.1. Let g7(@) be a nonnegative sequence of random variables 
depending on a parameter vector @. Consider the three modes of uniform 
convergence of g,(@) to 0: 
(i) P[limy. SUPgee gr(8) = 0] = 1, 

(ii) lim4.., P[supseg Sr (8) « €] = 1 forany e€>O0, 

(iii) limy_... infgeg P[gr(0) « e] - 1 forany e>0. 
If (i) holds, we say gr(0) converges to 0 almost surely uniformly in 0 E O. If (ii) 
holds, we say 2; (0) converges to 0 in probability uniformly in 0 € ©. If (iii) 
holds, we say g7(0) converges to 0 in probability semiuniformly in 0 € O. 


It is easy to show that (1) implies (ii) and (ii) implies (iii). Consider an 
example of a sequence for which (iii) holds but (ii) does not. Let the parameter 
space O be [0, 1] and the sample space Q also be [0, 1] with the probability 
measure equal to Lebesgue measure. For 0 € O and w € Q, define g;(@, 0) by 


. i i i+] 
= = — — = < — 
&r(w, 0)=1 if 0 T and 7 Sos: 
i—-0,1,...,T-—1, 
—( otherwise. 


Then, for0O « e «€ 1, 


infgeg P[gr(c, 0) « €] - (T — 1T and 
P[supesee g7(0, 0) << €] =0 for all T. 


Now we shall prove the consistency of extremum estimators. Because we 
need to distinguish between the global maximum and a local maximum, we 
shall present two theorems to handle the two cases. 


THEOREM 4.1.1. Make the assumptions: 
(A) The parameter space O is a compact subset of the Euclidean K-space 
(R*). (Note that @ is in O.) 
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(B) Q,(y, 8) is continuous in 0 € O for all y and is a measurable function of 
y for all 9 c O. 

(C) T-!Q4(0) converges to a nonstochastic function Q(0) in probability 
uniformly in 0 € O as T goes to ©, and Q(0) attains a unique global maximum 
at 05. (The continuity of Q(8) follows from our assumptions.) 

Define 0, as a value that satisfies 


Q,(6;) = max Qz(0). (4.1.1) 
6c6 


[It is understood that if 6, is not unique, we appropriately choose one such 
value in such a way that 0, (y) isa measurable function of y. This is possible by 
a theorem of Jennrich (1969, p. 637).] Then 6, converges to 0, in probability.! 


Proof. Let N be an open neighborhood in R* containing 4). Then NNO, 
where N is the complement of N in RX, is compact. Therefore maxgenno Q(0) 
exists. Denote 


€ = Q(6) — max Q(0). (4.1.2) 
Let Ay be the event "*|T^!Q4(0) — Q(0)| < €/2 for all 8.” Then 

Ar QUÀ.) > T-'Qr(ôr) — €/2 (4.1.3) 
and 

Ar => TQ7(p) > Q(0) — €/2. (4.1.4) 


But, because Qx(8,) = Qz(05) by the definition of ó,, we have from Exp. 
(4.1.3) 

A= Qr) > TQ) — €/2. (4.1.5) 
Therefore, adding both sides of the inequalities in (4.1.4) and (4.1.5), we 
obtain 

Ar = Q(8,) > Q(8;) — e. (4.1.6) 
Therefore, from (4.1.2) and (4.1.6) we can conclude A; = 6,€ N, which 
implies P(A;) = P(6; € N). But, since lim;_,., P(A7) = 1 by assumption C, 
07 converges to @ in probability. 

Note that 6, is defined as the value of 0 that maximizes Q,(8) within the 


parameter space O. This is a weakness of the theorem in so far as the extremum 
estimators commonly used in practice are obtained by unconstrained maxi- 
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mization or minimization. This practice prevails because of its relative com- 
putational ease, even though constrained maximization or minimization 
would be more desirable if a researcher believed that the true value lay in a 
proper subset of RK. The consistency of the unconstrained maximum 6; 
defined by 


Q7(6;) = sup Qr(6) (4.7) 


will follow from the additional assumption 


(D) limy_.. P[Qr(05) > supgee Q7(0)] = 1 
because of the inequality 


P|Q(&) > sup Qr(6)] 5 P(8, € 0). (4.1.8) 


As we shall see in later applications, Q+ is frequently the sum of indepen- 
dent random variables or, at least, of random variables with a limited degree of 
dependence. Therefore we can usually expect 7^ !Q4(0) to converge to a 
constant in probability by using some form of a law of large numbers. How- 
ever, the theorem can be made more general by replacing 7~'Q,(@) with 
h(T) !Q4(0), where h(T') is an increasing function of T. 

The three major assumptions of Theorem 4.1.1 are (1) the compactness of 
O, (2) the continuity of Q7(0), and (3) the uniform convergence of 7-!Q4(0) 
to Q(0). To illustrate the importance of these assumptions, we shall give 
examples that show what things go wrong when one or more of the three 
assumptions are removed. In all the examples, Q7 is assumed to be nonsto- 
chastic for simplicity. 


ExAMPLE 4.1.1. © =[-1, 1], 0,7 — 1/2, T-!Qz not continuous and not 
uniformly convergent. 


T9,(0)=1+0, 15056 
=—6, <@=0 
07 
71-97 0«0«1 
—-0, 0-1. 


Here the extremum estimator does not exist, although lim T~ !Q; attains its 
unique maximum at Ó,. 
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EXAMPLE 4.1.2. O =[—1, ©], 0,— —1/2, T~'Q; continuous but not uni- 
formly convergent. 


Q(0) — 1 t 0, —1s0560, 
=—96, —<@s0 
=0, elsewhere. 


T~'Q7(9) = Q(0) + hr(0), 
where 
h(0)=0-T, TSOST+1 
=T+2-0, T+1<0sT+2 
=0, elsewhere. 
Here we have plim 6, = plim (T + 1) = c, although lim T-'Q, = Q attains 


its unique maximum at 6,. 


EXAMPLE 4.1.3. O= [0, 2], 05 = 1.5, T" Qr continuous but not uniformly 
convergent. 


T"Q,(0)=T0, 0s0s— 


2T 
1 1 
= —— < — 
1 — 79, op SIST 
=0, =<6081 


—— (8-1, 1<056 


T 


= 
=F &«0s12 


Here we have plim 6,= plim (2T y^! = 0, although lim 7~'Q, attains its 
unique maximum at 4. 


EXAMPLE 4.1.4. O —[—2,1], b =— 1, T^!Q; not continuous but uni- 
formly convergent. 
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T'O,(@—)=(1-2'-7)9+2-27-7,  —25s05s-1 


=—(1- 21-7), -1i«6s0 
SERE) 0«65s1—2-7*9 
-in 841-27 1-2-799«8«1 
-0, 6-1 


Here we have plim 6, = plim [1 — 27(7*»] = 1, although lim 7—'Q, attains 
its unique maximum at 6). (If we change this example so that O = [— 2, 1), 
T~'Q, becomes continuous and only the compactness assumption is vio- 
lated.) 


The estimator 6, of Theorem 4.1.1 maximizes the function Q7(0) globally. 
However, in practice it is often difficult to locate a global maximum of Q,(9), 
for it means that we must look through the whole parameter space except in 
the fortunate situation where we can prove that Q,(@) is globally concave. 

Another weakness of the theorem is that it is often difficult to prove that 
Q(6) attains its unique global maximum at 0,. Therefore we would also like to 
have a theorem regarding the consistency of a local maximum. 

Still another reason for having such a theorem is that we can generally prove 
asymptotic normality only for the local maximum, as we shall show in Section 
4.1.2. Theorem 4.1.2 is such a theorem. 


THEOREM 4.1.2. Make the assumptions: 

(A) Let O be an open subset of the Euclidean K-space. ( Thus the true value 
0, is an interior point of O.) 

(B) Q4(y, 0) is a measurable function of y for all 0 € O, and 9Q 7/00 exists 
and is continuous in an open neighborhood N,(@,) of 0). (Note that this 
implies Q; is continuous for 6 € N,.) 

(C) There exists an open neighborhood N;(0,) of 0 such that T—'Q,(@) 
converges to a nonstochastic function Q(0) in probability uniformly in @ in 
N,(9p), and Q(6) attains a strict local maximum at 69. 


Let O; be the set of roots of the equation 


Qr 
00 


corresponding to the local maxima. If that set is empty, set 9, equal to (0). 


0 (4.1.9) 


Asymptotic Properties of Extremum Estimators 111 


Then, for any e > 0, 
lim P[ inf (0 — 6))’(@ — 6) > €] = 0. 
T—- 8c07 


Proof. Choose a compact set S C N, N Nz. Then the value of 8, say 0*., that 
globally maximizes Qz(0) in Sis consistent by Theorem 4.1.1. But because the 
probability that T-'Q7(0) attains a local maximum at 0* approaches 1 as T 
goes to ^, limy- P(0*. E O,) = 1. 


We sometimes state the conclusion of Theorem 4.1.2 simply as "there is a 
consistent root of the Eq. (4.1.9).” 

The usefulness of Theorem 4.1.2 is limited by the fact that it merely states 
that one of the local maxima is consistent and does not give any guide as to 
how to choose a consistent maximum. There are two ways we can gain some 
degree of confidence that a local maximum is a consistent root: (1) if the 
solution gives a reasonable value from an economic-theoretic viewpoint and 
(2) if the iteration by which the local maximum was obtained started from a 
consistent estimator. We shall discuss the second point more fully in Section 
4.4.2. 


4.1.2 Asymptotic Normality 


In this subsection we shall show that under certain conditions a consistent root 
of Eq. (4.1.9) is asymptotically normal. The precise meaning of this statement 
will be made clear in Theorem 4.1.3. 


THEOREM 4.1.3. Make the following assumptions in addition to the as- 
sumptions of Theorem 4.1.2: 

(A) &Q,/000’ exists and is continuous in an open, convex neighborhood 
of 6. 

(B) T~(8Q7/0680’)», converges to a finite nonsingular matrix A(6)) = 
lim ET-(22Q,/9000'), in probability for any sequence 07 such that 
plim 0% = 6. 

(C) T-'(8Q/80), — N[0, B(6;)], where B(6)) = lim E T-(0Q,/80)4, X 
(607/86 a: 

Let (07) be a sequence obtained by choosing one element from O; defined in 
Theorem 4.1.2 such that plim 6, = 8). (We call 0; a consistent root.) Then 


VT (6; — 6) — N[O, A(Q)~"B(4)A(8)~"1. 
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Proof. By a Taylor expansion we have 


—9Qr| Qr (5 _ 
4^9, 3066'|,. (9r 6, (4.1.10) 


where 0* lies between 6, and @).? Noting that the left-hand side of (4.1.10) is 0 
by the definition of 05, we obtain 


^ 1 @ 
YT(8, — b) = zie 


[ l 30r 
el VT 38 le 
where + denotes the Moore-Penrose generalized inverse (see note 6 of Chapter 
2). Because plim 0* = 6), assumption B implies 


1l POr 
T 8880 
Finally, the conclusion of the theorem follows from assumption C and Eqs. 
(4.1.11) and (4.1.12) by repeated applications of Theorem 3.2.7. 


(4.1.11) 


plim = A(0,). (4.1.12) 


e 


As we noted earlier, Q, is frequently the sum of independent random 
variables (or, at least, of random variables with a limited degree of depen- 
dence). Therefore it is not unreasonable to assume the conclusions of a law of 
large numbers and a central limit theorem in assumptions B and C, respec- 
tively. However, as we also noted earlier, the following more general normali- 
zation may be necessary in certain cases: In assumption B, change 
T Q,/8000' to H(T)2??Q,/0090'H(T), where H(T) is a diagonal matrix 
such that limz.., H(7) —0; in assumption C, change 77120Q;,/00 to 
H(7T)0Q 7/80; and in the conclusion of the theorem, state the limit distribu- 
tion in terms of H(T (07 — 0,). 

Because assumption B is often not easily verifiable, we shall give two alter- 
native assumptions, each of which implies assumption B. Let g7(@) = g(y, 0) 
be a function of a 7-vector of random variables y —(y,, Y2». . . , yr) anda 
continuous function of a K-vector of parameters @ in O, an open subset of the 
Euclidean K-space, almost surely. We assume that g(y, 0)isarandom variable 
(that is, g is a measurable function of y). We seek conditions that ensure 
plim [g7(07) — 27(05)) = 0 whenever plim 0; = 0$. Note that Theorem 3.2.6 
does not apply here because in that theorem g(-) is a fixed function not 
varying with T. 


THEOREM 4.1.4. Assume that 0¢,-/00 exists for 0 € ©, an open convex set, 
and that for any € > 0 there exists M, such that 


P(\ag7/00'|< M) &1— € 
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for all T, for all 0 € O, and for all i, where 0!is the ith element of the vector 6. 
Then plim gr(Or) = = plim gr(0,) if 0, = plim ó,i is in O. 


Proof. 'The proof of Theorem 4.1.4 follows from the Taylor expansion 


gi (0) = 8r (&,) + 98r (6, — bo), 


ol, 


where 0* lies between 6, and b. 


THEOREM 4.1.5. Suppose g,(@) converges in probability to a nonstochastic 
function 2(8) uniformly in @ in an open neighborhood N(6,) of 6. Then 
plim g7(07) = g(0,) if plim 0, = 0, and g(0) is continuous at 9. 


Proof. Because the convergence of g,(@) is uniform in @ and because 6, € 
N(0,) for sufficiently large T with a probability as close to one as desired, we 
can show that for any € > 0 and ô > 0, there exists T, such that for T> T, 


A A ó 
P [iers — g(07)| 2 s] < 3 (4.1.13) 


Because g is continuous at 05 by our assumption, (br) converges to g(0,) in 
probability by Theorem 3.2.6. Therefore, for any e > 0 and ó > 0, there exists 
T, such that for T> T, 


^ ô 
P ce — g(4)|= s] < a (4.1.14) 


Therefore, from the inequalities (4.1.13) and (4.1.14) we have for T» 
max [T;, T3] 


P[|gr (87) — &(09) s e] = 1 — ô. (4.1.15) 


The inequality (4.1.13) requires that uniform convergence be defined by 
either (i) or (ii) of Definition 4.1.1. Definition 4.1.1(iii) is not sufficient, as 
shown in the following example attributed to A. Ronald Gallant. Let (0) be 
a continuous function with support on [— T^!, T-!] and $,(0) = 1. Define 
g1(0, 0) = G4(o — T0)ifO s w, 0 = 1, and gr(o, 0) = 0 otherwise. Assume 
that the probability measure over 0 S c £ 1 is Lebesgue measure and 

= @/T. Then 


. 2 
= ——— 
inf. P[g7z(0, 0) « e] z 1 1, 


meaning that g7(c, 0) converges to 0 semiuniformly in 0. But g,(c, 67) =1 
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for all T. Note that in this example 
P[ sup gr(o, 0) < e] = 
0031 


Hence, 27(c, 0) does not converge uniformly in @ in the sense of definition 
4.1.1(ii). 

The following theorem is useful because it gives the combined conditions 
for the consistency and asymptotic normality of a local extremum estimator. 
In this way we can do away with the condition that Q(0) attains a strict local 
maximum at 65. 


THEOREM 4.1.6. In addition to assumptions A-C of Theorem 4.1.3, as- 
sume 

(A) T^! Q;4(0) converges to a nonstochastic function Q(0) in probability 
uniformly in @ in an open neighborhood of 4). 

(B) A(0,) as defined in assumption B of Theorem 4.1.3 is a negative definite 
matrix. 

(C) plim 77!82Q./0080' exists and is continuous in a neighborhood of 0,. 
Then the conclusions of Theorems 4.1.2 and 4.1.3 follow. 


Proof. By a Taylor expansion we have in an open neighborhood of 6, 


1 lð 
F 09 g OD ag, O- (4.1.16) 
18 
+5 (0-00) 7 Spee] (0 — 09. 


where 0* lies between @ and 6). Taking the probability limit of both sides of 
(4.1.16) and using assumptions B and C of Theorem 4.1.3 and A of this 
theorem, we have 


Q(0) = Q(G) + 4(8 — 6) A*(0 — 8p), (4.1.17) 
where 
1 Qr 
* — 
A" plim ao, 


But A* isa negative definite matrix because of assumption B of Theorem 4.1.3 
and assumptions B and C of this theorem. Therefore 


Q(8)« Q(&) for 076, (4.1.18) 


Thus all the assumptions of Theorem 4.1.2 are satisfied. 
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4.2 Maximum Likelihood Estimator 
4.2.1 Definition 


Let L (0) = L(y, 0) be the joint density of a 7-vector of random variables 
y= (Yi, Y» - . . » yr) characterized by a K-vector of parameters 0. When we 
regard it as a function of 0, we call it the likelihood function. The term 
maximum likelihood estimator (MLE) is often used to mean two different 
concepts: (1) the value of @ that globally maximizes the likelihood function 
L(y, 0) over the parameter space 9; or (2) any root of the likelihood equation 


àL,(0) _ 


T ^0 (4.2.1) 


that corresponds to a local maximum. We use it only in the second sense and 
use the term global maximum likelihood estimator to refer to the first concept. 
We sometimes use the term /ocal maximum likelihood estimator to refer to 
the second concept. 


4.2.2 Consistency 


The conditions for the consistency of the global MLE or the local MLE can be 
immediately obtained from Theorem 4.1.1 or Theorem 4.1.2 by putting 
Q,(@) = log L;(@). We consider the logarithm of the likelihood function 
because 7^! log L (0) usually converges to a finite constant. Clearly, taking 
the logarithm does not change the location of either global or local maxima. 

So far we have not made any assumption about the distribution of y. If we 
assume that (y,) are i.i.d. with common density function f(- , 0), we can write 


T 
log L(y, 0) = > log f(y, 0). (4.2.2) 
t-1 
In this case we can replace assumption C of either Theorem 4.1.1 or Theorem 
4.1.2 by the following two assumptions: 


E sup [log f(y,, 8)|< M, for some positive constant M, (4.2.3) 
€ 


and 
log f(y,, 0) isacontinuous function of 0 foreach y. (42.4) 


In Theorem 4.2.1 we shall show that assumptions (4.2.3) and (4.2.4) imply 


116 Advanced Econometrics 


plim — reo. 0)— E log f(y,, 0) unifornlyin 0c90. 
(4.2.5) 


Furthermore, we have by Jensen’s inequality (see Rao, 1973, p. 58, for the 
proof) 


fi, 9) f, 9) 
Elo <log E=—"— =0 for 076, 4.2.6 
PO 8) E FU, 8) ° 4:20) 
where the expectation is taken using the true value Q, and, therefore 
E log f(y, 0) < Elog/(y,, Oo) for 0+0. (4.2.7) 


Asin (4.2.7), we have T^! Elog L1(0) < T-'Elog Lz(0,) for 0 # 0, and for 
all T. However, when we take the limit of both sides ofthe inequality (4.2.7) as 
T goes to infinity, we have 


lim T'E log Lr(0) £ lim T-!E log L7(). 


Hence, one generally needs to assume that lim;,,. T'E log L;(6) is 
uniquely maximized at 0 = 6). 

That (4.2.3) and (4.2.4) imply (4.2.5) follows from the following theorem 
when we put 2,6) = log f(y,, 0) — E log f(y,, 0). 


THEOREM 4.2.1. Let g(y, 0) be a measurable function of y in Euclidean 
space for each 0 € ©, a compact subset of R* (Euclidean K-space), and a 
continuous function of Ó € O for each y. Assume E g(y, 0) = 0. Let {y,} bea 
sequence of i.i.d. random vectors such that E supgeg Ig(y,, |< ©. Then 
T-!XL g(y,, 0) converges to 0 in probability uniformly in 0 € 8. 


Proof? Partition © into n nonoverlapping regions 07,02,. .. , 07 in 
such a way that the distance between any two points within each O7 goes to 0 as 
n goes to ©. Let 0,,0,,. . . , 0, be an arbitrary sequence of K-vectors such 
that 0; € 67,i= 1,2,. . . , n. Then writing g(0) for g(y,, 0), we have for any 
e>0 


> e| (4.2.8) 


>} 


T 
—1 
"[zg|r* o 


n T 
= PLU) [ng fr X. 


i=l 
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-4 


T^ z XO s] + X P| r$ > sup Ig(0) — 848,)| > 1 


i=1 4 


T 
T^ Y 80) 


1i 


=») P|su 
im 
«i| 

imi 
where the first inequality follows from the fact that if A implies B then 


P(A) = P(B) and the last inequality follows from the triangle inequality. Be- 
cause g(0) is uniformly continuous in @ € O, we have for every i 


lim sup lg(0) — (0)! = 0. (4.2.9) 
n— 0€07 

But, because 
sup lg(0) — e(8;)| S 2 sup |g(8)| (4.2.10) 
6c607 6c6 


and the right-hand side of the inequality (4.2.10) is integrable by our assump- 
tions, (4.2.9) implies by the Lebesgue convergence theorem (Royden, 1968, 
p. 88) 


lim E sup le(80) — g(8)| = 0 (4.2.11) 


uniformly for i. Take 7 so large that the expected value in (4.2.11) is smaller 
than e/2. Finally, the conclusion of the theorem follows from Theorem 3.3.2 
(Kolmogorov LLN 2) by taking T to infinity. 


This theorem can be generalized to the extent that 7T! XI ,g(0) and 
T^'XL,SUpseo; g(8) — g(8,)| can be subjected to a law of large numbers. The 
following theorem, which is a special case of a theorem attributed to Hoadley 
(1971) (see also White, 1980b), can be proved by making a slight modification 
of the proof of Theorem 4.2.1 and using Markov's law of large numbers 
(Chapter 3, note 10). 


THEOREM 4.2.2. Let g/(y, 0) be a measurable function of y in Euclidean 
space for each ż and for each 0 € O, a compact subset of RK (Euclidean 
K-space), and a continuous function of 0 for each y uniformly in t. Assume 
E gly, 0) — 0. Let (y) be a sequence of independent and not necessarily 
identically distributed random vectors such that E supgeo |g(y,, 0)! *^ s 
M < œ for some å > 0. Then 7^! XL ey, 0) converges to 0 in probability 
uniformly in 9 c 0. 


We will need a similar theorem (Jennrich, 1969, p. 636) for the case where y, 
is a vector of constants rather than of random variables. 
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THEOREM 4.2.3. Lety,, y2,. . . , yr be vectors of constants. We define the 
empirical distribution function of (y,, y3,. . . , yr) by Fra) = TEL, 
X(y, € à), where x takes the value ! or 0 depending on whether the event in its 
argument occurs or not. Note that y, < œ means every element of the vector y, 
is smaller than the corresponding element of a. Assume that g(y, 0) is a 
bounded and continuous function of y in Euclidean space and 0 in a compact 
set O. Also assume that F converges to a distribution function F. Then 
iMg- TEZ gy, 6) = Sey, 0) dF(y) uniformly in 8. 


There are many results in the literature concerning the consistency of the 
maximum likelihood estimator in the i.i.d. case. Rao (1973, p. 364) has 
presented the consistency of the local MLE, which was originally proved by 
Cramér (1946). Wald (1949) proved the consistency of the global MLE with- 
out assuming compactness of the parameter space, but his conditions are 
difficult to verify in practice. Many other references concerning the asymp- 
totic properties of the MLE can be found in survey articles by Norden (1972, 
1973). 

As an example of the application of Theorem 4.1.1, we shall prove the 
consistency of the maximum likelihood estimators of f and c? in Model 1. 
Because in this case the maximum likelihood estimators can be written as 
explicit functions of the sample, it is easier to prove consistency by a direct 
method, as we have already done in Section 3.5. We are considering this more 
complicated proof strictly for the purpose of illustration. 

EXAMPLE 4.2.1. Prove the consistency of the maximum likelihood estima- 
tors ofthe parameters of Model 1 with normality using Theorem 4.1.1, assum- 
ing that lim 7! X'X is a finite positive definite matrix. 

In Section 1.1 we used the symbols 8 and c? to denote the true values 
because we did not need to distinguish them from the domain ofthe likelihood 
function. But now we shall put the subscript 0 to denote the true value; 
therefore we can write Eq. (1.1.4) as 


y = Xf) + u, (4.2.12) 
where Vu, = o. From (1.3.1) we have 
T 1 
log L= -3 log 22 — 7 log a? — p» (y — X8 y (y — XB) (42.13) 


=I -T beo? 
= z log 27 z logo 


— is [X(f — A) + ulIX(f, — 8) + u), 


Asymptotic Properties of Extremum Estimators 119 


where the second equality is obtained by using (4.2.12). Therefore 


. 1 1 1 
plim T log Lr = 73 log 2z — z log o? (4.2.14) 
|. 1 os XX a 1 a 
2g? (Bo B) lim T (Bo B) 260 
Define a compact parameter space O by 
eosoeseo PBSc;, (4.2.15) 


where c; is a small positive constant and c; and c, are large positive constants, 
and assume that (84, 02) is an interior point of O. Then, clearly, the conver- 
gence in (4.2.14) is uniform in © and the right-hand side of (4.2.14) is uniquely 
maximized at (ffo, o2). Put 0 = (', 07)’ and define 8, by 


log L, (05) = max log Lz(). (4.2.16) 


Then 6, is clearly consistent by Theorem 4.1.1. Now define 6, by 
log L(8,) = max log Lz(0), (4.2.17) 


where the maximization in (4.2.17)is over the whole Euclidean (K + 1)-space. 
Then the consistency of 07., which we set out to prove, would follow from 


lim P(6,= 8) = 1. (4.2.18) 
The proof of (4.2.18) would be simple if we used our knowledge of the explicit 
formulae for 0, in this example. But that would be cheating. The proof of 


(4.2.18) using condition D given after the proof of Theorem 4.1.1 is left as an 
exercise. 


There are cases where the global maximum likelihood estimator is incon- 
sistent, whereas a root of the likelihood equation (4.2.1) can be consistent, as 
in the following example. 


EXAMPLE4.22. Lety,,f=1,2,... , T, beindependent with the common 
distribution defined by 


fp) =N(u,, a?) with probability 4 (4.2.19) 
= N(u5,02) with probability 1 — å. 


This distribution is called a mixture of normal distributions. The likelihood 
function is given by 
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rr A 
L=] | Tino, P [7G — /Qo1) (4.2.20) 
1—-A MEN A | 
+ Tina, exp [~ (V; — 433/202) |. 


If we put u, = y, and let v, approach 0, the term of the product that corre- 
sponds to ¢ = 1 goes to infinity, and, consequently, L goes to infinity. Hence, 
the global MLE canot be consistent. Note that this example violates assump- 
tion C of Theorem 4.1.1 because Q(0) does not attain a global maximum at 6). 
However, the conditions of Theorem 4.1.2 are generally satisfied by this 
model. An extension ofthis model to the regression case is called the switching 
regression model (see Quandt and Ramsey, 1978). 


It is hard to construct examples in which the maximum likelihood estima- 
tor (assuming the likelihood function is correctly specified) is not consistent 
and another estimator is. Neyman and Scott (1948) have presented an inter- 
esting example of this type. In their example MLE is not consistent because 
the number of incidental (or nuisance) parameters goes to infinity as the 
sample size goes to infinity. 


4.2.3 Asymptotic Normality 


The asymptotic normality of the maximum likelihood estimator or, more 
precisely, a consistent root of the likelihood equation (4.2.1), can be analyzed 
by putting Qr = log Ly in Theorem 4.1.3. If (yj) are independent, we can 
write 


T 
log Lz— 2 log fy, 0), (4.2.21) 
(= 


where f, is the marginal density of y,. Thus, under general conditions on f,, we 
can apply a law of large numbers to & log L-/d006’ and a central limit 
theorem to à log L,/d@. Even if {y,} are not independent, a law of large 
numbers and a central limit theorem may still be applicable as long as the 
degree of dependence is limited in a certain manner, as we shall show in later 
chapters. Thus we see that assumptions B and C of Theorem 4.1.3 are ex- 
pected to hold generally in the case of the maximum likelihood estimator. 
Moreover, when we use the characteristics of L yas a joint density function, 
we Can get more specific results than Theorem 4.1.3, namely as we have shown 
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in Section 1.3.2, the regularity conditions on the likelihood function given in 
assumptions A’ and B’ of Section 1.3.2 imply 


A(0,) = — B(@,). (4.2.22) 
Therefore, we shall make (4.2.22) an additional assumption and state it for- 
mally as a theorem. 


THEOREM 4.2.4. Under the assumptions of Theorem 4.1.3 and assumption 
(4.2.22), the maximum likelihood estimator 6; satisfies 


1 # log Lr [ ] 
T — 3000 a . (4.2.23) 


If (yj) are i.i.d. with the common density function f( - , 0), we can replace 
assumptions B and C of Theorem 4.1.3 as well as the additional assumption 
(4.2.22) with the following conditions on /f( - , 0) itself: 


VT(8,— 0, >N le. [im E= 


IE: of dy — 0, (4.2.24) 
f 
ir sane ^ 7 9 (4.2.25) 


T & # log f f. EË log f . . . 
mD 2000 E "S000 uniformly in @ inan open 
neighborhood of 6,. (4.2.26) 


A sufficient set of conditions for (4.2.26) can be found by putting g(8) = 
8? log f,/00,00, in Theorem 4.2.1. Because log Lr = EZ, log f(y,, 0) in this 
case, (4.2.26) implies assumption B of Theorem 4.1.3 because of Theorem 
4.1.5. Assumption C of Theorem 4.1.3 follows from (4.2.24) and (4.2.26) on 
account of Theorem 3.3.4 (Lindeberg-Lévy CLT) since (4.2.24) implies 
E(@ log f/00)4, = 0. Finally, it is easy to show that assumptions (4.2.24)- 
(4.2.26) imply (4.2.22). 

We shall use the same model as that used in Example 4.2.1 and shall 
illustrate how the assumptions of Theorem 4.1.3 and the additional assump- 
tion (4.2.22) are satisfied. As for Example 4.2.1, the sole purpose of Example 
4.2.3 is as an illustration, as the same results have already been obtained by a 
direct method in Chapter 1. 


EXAMPLE 4.2.3. Under the same assumptions made in Example 4.2.1, 
prove the asymptotic normality of the maximum likelihood estimator 


0 — (p, oy. 
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We first obtain the first and second derivatives of log L: 


-- = -5 (X'XB — X'y), (4.2.27) 
E eut zsa (Y — XY — XA), (4228) 
Z a L. -5 X'X, (4.2.29) 
TOS =~ i Y- MG - XP. (4.2.30) 
-— 7 = E (X'Xf — X'y). (4.231) 


From (4.2.29), (4.2.30), and (4.2.31) we can clearly see that assumptions A 
and B of Theorem 4.1.3 are satisfied. Also from these equations we can 
evaluate the elements of A(0,): 


.1 S log L 1, XX 

plim = e. vi lim =, (4.2.32) 
.,.19logL| X 1 

plim T ac D ja = 204’ (4.2.33) 
. t 9 log L 

plim =o arah ly (4.2.34) 

From (4.2.27) and (4.2.28) we obtain 
.l dlogL) _ 1 X'u and 1 dlogL} _ 1 wu- To$j 
JT 38 |, 03 VT IT $e | 20$ VT 


Thus, by applying either the Lindeberg-Feller or Liapounov CLT to a se- 
quence of an arbitrary linear combination of a (K + 1}-vector 


(Xite, X3, > . . , Xlr, UŽ — 02), we can show 
1 ólogL (o. XX) 
sO NO," = lim —— (4.2.35) 
VT OP |, DE T 
and 
1 8logL 1 
TE aa? UM (o i) (4.2.36). 
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Figure 4.1 The log likelihood function in a nonregular case 


with zero asymptotic covariance between (4.2.35) and (4.2.36). Thus assump- 
tion C of Theorem 4.1.3 has been shown to hold. Finally, results (4.2.32) 
through (4.2.36) show that assumption (4.2.22) is satisfied. We write the 
conclusion (4.2.23) specifically for the present example as 
— 2 (li —IX^XY-1 
VT E KA >N fo, |^ (lim ^ XX MI (4.2.37) 
There are cases where the global maximum likelihood estimator exists but 
does not satisfy the likelihood equation (4.2.1). Then Theorem 4.1.3 cannot 
be used to prove the asymptotic normality of MLE. The model of Aigner, 
Amemiya, and Poirier (1976) is such an example. In their model, plim T~! 
log Lzexists and is maximized at the true parameter value 4 so that MLE is 
consistent. However, problems arise because plim T~! log Lisnotsmooth at 
0,; it looks like Figure 4.1. In such a case, it is generally difficult to prove 
asymptotic normality. 


4.2.4 Asymptotic Efficiency 


The asymptotic normality (4.2.23) means that if Tis large the variance-covar- 
iance matrix of a maximum likelihood estimator may be approximated by 


L # log Lr 
E 3006" 


But (4.2.38) is precisely the Cramér-Rao lower bound of an unbiased estima- 
tor derived in Section 1.3.2. At one time statisticians believed that a consistent 
and asymptotically normal estimator with the asymptotic covariance matrix 


-1 
| , (4.2.38) 
8 
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(4.2.38) was asymptotically minimum variance among all consistent and 
asymptotically normal estimators. But this was proved wrong by the following 
counterexample, attributed to Hodges and reported in LeCam (1953). 


EXAMPLE 4.24. Let 6, be an estimator of a scalar parameter such that 
plim 6, = = Q and VT T (6, — 0) — N[O, »(0)]. Define the estimator 0$ = WrOr, 
where 


wp-0 if |67)< T" 
=1 if lójz T”. 


It can be shown (the proof is left as an exercise) that VT (0$ — 0) 
N[0, v*(0)], where v*(0) = 0 and v*(0) = w0) if 0 # 0. 

The estimator 07 of Example 4.2.4 is said to be superefficient. Despite the 
existence of superefficient estimators, we can still say something good about 
an estimator with the asymptotic variance-covariance matrix (4.2.38). We 
shall state two such results without proof. One is the result of LeCam (1953), 
which states that the set of 0 points on which a superefficient estimator has an 
asymptotic variance smaller than the Cramér-Rao lower bound is of Lebesgue 
measure zero. The other is the result of Rao (1973, p. 350) that the matrix 
(4.2.38) is the lower bound for the asymptotic variance-covariance matrix of 
all the consistent and asymptotically normal estimators for which the conver- 
gence to a normal distribution is uniform over compact intervals of 0. These 
results seem to justify our use of the term asymptotically efficient in the 
following sense: 


DEFINITION 4.2.1. A consistent estimator is said to be asymptotically efi- 
cient if it satisfies statement (4.2.23). 


Thus the maximum likelihood estimator under the appropriate assump- 
tions is asymptotically efficient by definition. An asymptotically efficient 
estimator is also referred to as best asymptotically normal (BAN for short). 
There are BAN estimators other than MLE. Many examples of these will be 
discussed in subsequent chapters, for example, the weighted least squares 
estimator will be discussed Section 6.5.3, the two-stage and three-stage least 
squares estimators in Sections 7.3.3 and 7.4, and the minimum chi-square 
estimators in Section 9.2.5. Barankin and Gurland (1951) have presented a 
general method of generating BAN estimators. Because their results are math- 
ematically too abstract to present here, we shall state only a simple corollary of 
their results: Let (yj) bean i.i.d. sequence of random vectors with E y, = u(0), 
Ely, — uXy, — HY = ZX(0), and with exponential family density 
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K 
fCy, 0) = exp [aq(0) + Boly) + 2 o(0)5(y)] 


and define z;— T^!ZL,y, Then the minimization of [z7— 4(0)]| 
Z(0)"'[z,. — u(0)] yields a BAN estimator of 0 (see also Taylor, 1953; Fergu- 
son, 1958). 

Different BAN estimators may have different finite sample distributions. 
Recently many interesting articles in the statistical literature have compared 
the approximate distribution (with a higher degree of accuracy than the 
asymptotic distribution) of MLE with those of other BAN estimators. For 
example, Ghosh and Subramanyam (1974) have shown that in the exponen- 
tial family the mean squared error up to O(T7?) of MLE after correcting its 
bias up to O(7—') is smaller than that of any other BAN estimator with 
similarly corrected bias. This result is referred to as the second-order efficiency 
of MLE,’ and examples of it will be given in Sections 7.3.5 and 9.2.6. 


4.2.5 Concentrated Likelihood Function 


We often encounter in practice the situation where the parameter vector 0) 
can be naturally partitioned into two subvectors o and fy as 0, = (a, Bo). 
The regression model is one such example; in this model the parameters 
consist ofthe regression coefficients and the error variance. First, partition the 
maximum likelihood estimator as 0 = (a@’, B"). Then, the limit distribution of 
YT(á — a) can easily be derived from statement (4.2.23). Partition the in- 
verse of the asymptotic variance-covariance matrix of statement (4.2.23) con- 
formably with the partition of the parameter vector as 


uo p 1 # log Lr -|$ 4 
lim E = — LB c (4.2.39) 
Then, by Theorem 13 of Appendix 1, we have 
YT(À — œ) — N[0, (A — BC-!B/-!]. (4.2.40) 


Let the likelihood function be L(a, 8). Sometimes it is easier to maximize L 
in two steps (first, maximize it with respect to fj, insert the maximizing value of 
f back into L; second, maximize L with respect to a) than to maximize L 
simultaneously for œ and $. More precisely, define 


L*(a) = Lia, Ra), (4.2.41) 


where Ba) is defined as a root of 
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Olog Li _ 
A h 0, (4.2.42) 
and define à as a root of 
à log L* 
slogi =0. 42. 
p" J 0 (4.2.43) 


We call L*(q@) the concentrated likelihood function of a. In this subsection we 
shall pose and answer affirmatively the question: If we treat L*(a) as if it were a 
proper likelihood function and obtain the limit distribution by (4.2.23), do we 
get the same result as (4.2.40)? 

From Theorem 4.1.3 and its proof we have 


1 logL*| [! 1 ólogZ* 
T dada’ VT oa 
where E? means that both sides of the equation. have the same limit distribu- 


tion. Differentiating both sides of (4.2.41) and evaluating the derivative at a 
yields 


VT(& — a)? -[pim 7 | , (4.244) 
Qo 


ólogL*|  ólogL op’| alog L 
———| = — (4.2.45) 
90 lag 9€ laa) 981g, 98 laia 
= d log L 
ða Ne, le 


where the second equality follows from (4.2.42). By a Taylor expansion we 
have 


àlogL 6 log L logL| ,z 
= + -| [Ao — Bol (4.2.46) 
ôa ao Aao) ðq % 9aofi e* Flag Pal 
where 0* lies between [a4, Âa] and 6). But we have 
a . log L| [! 1 ólogL 
TA) - A1 - [im 575287 | BL 4; 
[Kag) — Bol lim E 7 app l| VT o8 |. (4.2.47) 


Therefore, from (4.2.45), (4.2.46), and (4.2.47), we obtain 


1 Z2 LD 1 dlogL 
— 1, -BC'! . 4.2.48 
mE ( JR 36». (4.248) 
Finally, using (4.2.22), we obtain from (4.2.48) 
1 dlog L* 
— — NO, A — BC^!B^. 4.2.4 
JT i ( ) ( 9) 
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Next, differentiating both sides of the identity 
ð log L*(a) _ ô log Lia, Ala) 


4.2.50 
Ja 30 ( ) 
with respect to a yields 
k , 
# log L*| , #log L +38 FlgL — (4.2.51) 
Differentiating both sides of (4.2.42) with respect to @ yields 
log L op’| æ log : 
: — : =0. (4.2.52) 
dadh laia Fla, FBOB’ la ioo 
Combining (4.2.51) and (4.2.52) yields 
& log L* & log z 
S7] = - (4.2.53) 
dada’ lex dada le, Alexa) 
_ # log L ES : | 8 log z 
dadh” lafa L 9898" lu, ioo 9fó0" |, fen) 
which implies 
1 & log L* 
im — ————-| =—(A — BC"!B^. 4.2.54 
plim T dase’ |, (A C ) ( ) 


Finally, we have proved that (4.2.44), (4.2.49), and (4.2.54) lead precisely to 
the conclusion (4.2.40) as desired. 


4.3 Nonlinear Least Squares Estimator 
4.3.1 Definition 


We shall first present the nonlinear regression model, which is a nonlinear 
generalization of Model 1 of Chapter 1. The assumptions we shall make are 
also similar to those of Model 1. As in Chapter 1 we first shall state only the 
fundamental assumptions and later shall add a few more assumptions as 
needed for obtaining particular results. 

We assume 


yf(É)tu, t=1,2,...,7, (4.3.1) 


where y, is a scalar observable random variable, fj, is a K-vector of unknown 
parameters, and (u,) are iid. unobservable random variables such that 
Eu, = 0 and Vu, = c$ (ancther unknown parameter) for all t. 
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The assumptions on the function f, will be specified later. Often in practice 
we can write /(f) — f(X: Ao), where x, is a vector of exogenous variables 
(known constants), which, unlike the linear regression model, may not neces- 
sarily be of the same dimension as fj. 

As in Chapter 1, we sometimes write (4.3.1) in vector form as 


y = f(A) + u, (4.3.2) 


where y, f, and u are all 7-vectors, for which the fth element is defined in 
(4.3.1). 

Nonlinearity arises in many diverse ways in econometric applications. For 
example, it arises when the observed variables in a linear regression model are 
transformed to take account of serial correlation of the error terms (cf. Section 
6.3). Another example is the distributed-lag model (see Section 5.6), in which 
the coefficients on the lagged exogenous variables are specified to decrease 
with lags in a certain nonlinear fashion. In both of these examples, nonlinear- 
ity exists only in parameters and not in variables. 

More general nonlinear models, in which nonlinearity is present both in 
parameters and variables, are used in the estimation of production functions 
and demand functions. The Cobb-Douglas production function with an addi- 
tive error term is given by 


Q, = B, KELE + u, (4.3.3) 


where Q, K, and L denote output, capital input, and labor input, respectively.® 
The CES production function (see Arrow et al., 1961) may be written as 


Qi = Pil EK; + (1 — BL PT + u. (4.3.4) 


See Mizon (1977) for several other nonlinear production functions. In the 
estimation of demand functions, a number of highly nonlinear functions have 
been proposed (some of these are also used for supply functions), for example, 
translog (Christensen, Jorgenson, and Lau, 1975), generalized Leontief ( Die- 
wert, 1974), S-branch (Brown and Heien, 1972), and quadratic (Howe, Pol- 
lak, and Wales, 1979). 

As in the case of the maximum likelihood estimator, we can define the 
nonlinear least squares estimator (abbreviated as NLLS) of f, in two ways, 
depending on whether we consider the global minimum or a local minimum. 
In the global case we define it as the value of f that minimizes 


T 
Sr= p [s — AAP (4.3.5) 
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over some parameter space B. In the local case we define it as a root of the 
normal equation 


OT .. 9. (4.3.6) 


We shall consider only the latter case because (4.3.6) is needed to prove 
asymptotic normality, as we have seen in Section 4.1.2. Given the NLLS 
estimator f of f, we define the NLLS estimator of 02, denoted as ĝ?, by 


62 = TS). (4.3.7) 


Note that B and a? defined above are also the maximum likelihood estima- 
tors if the {u,} are normally distributed. 


4.3.2 Consistency’ 


We shall make additional assumptions in the nonlinear regression model so 
that the assumptions of Theorem 4.1.2 are satisfied. 


THEOREM 4.3.1. In the nonlinear regression model (4.3.1), make the addi- 
tional assumptions: There exists an open neighborhood N of f, such that 

(A) ðf /ðß exists and is continuous on N. 

(B) f(fl) is continuous in f € N uniformly in t; that is, given € > 0 there 
exists ô > O such that|f(8,) — /(B;)| < € whenever (8, — B,Y (f, — R2) < dfor 
all J,, B, € N and for all 2.8 

(C) T?!XL,f(B)f (ffl) converges uniformly in B,, &; € N. 

(D) lim TEZZA) — SAB)P + 0 if B Bo. 


Then a root of (4.3.6) is consistent in the sense of Theorem 4.1.2. 
Proof. Inserting (4.3.1) into (4.3.5), we can rewrite T^! times (4.3.5) as 


1 14 14 
a o u (4.3.8) 


+E D B) -IP 


=4 +4 +4. 


The term A, converges to c2 in probability by Theorem 3.3.2 (Kolmogorov 
LLN 2). The term 4, converges to a function that has a local minimum at fl, 
uniformly in f because of assumptions C and D. We shall show that A, 
converges to 0 in probability uniformly in f € N by an argument similar to the 
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proof of Theorem 4.2.1. First, plimz... T~'Z2,f( o)u, = 0 because of as- 
sumption C and Theorem 3.2.1. Next, consider suppen T 127, f f)u,|. Par- 
tition N into n nonoverlapping regions N,, N,,. . . , Na. Because of as- 
sumption B, for any e > 0 we can find a sufficiently large 7 such that for each 
i=1,2,...,n 


VB) — KB) € ise for f,,B.EN; and forall £. 
(4.3.9) 


Therefore, using the Cauchy-Schwartz inequality, we have 


12 Ji € 
(à ^] Qoi DW 


(4.3.10) 


sup [7 


T 2 » f(Byu = 


Tn. 


where fj, is an arbitrary fixed point in N;. Therefore 


P| sup 


T 


FÈ 2», Pyu, 


-] 


>]s $ P| sup 


z] 


elt 3 > 03 +1]. (4.3.11) 


PUE 


Finally, we obtain the desired result by taking the limit of both sides of the 
inequality (4.3.11) as T goes to o». Thus we have shown that assumption C of 
Theorem 4.1.2 is satisfied. Assumptions A and B of Theorem 4.1.2 are clearly 
satisfied. 


Assumption C of Theorem 4.3.1 is not easily verifiable in practice; therefore 
itis desirable to find a sufficient set of conditions that imply assumption C and 
are more easily verifiable. One such set is provided by Theorem 4.2.3. To 
apply the theorem to the present problem, we should assume f(f) = f(x,, B) 
and take x, and f( f) /( B2) as the y, and g(y,, 0) of Theorem 4.2.3, respectively. 
Alternatively, one could assume that (x,) are i.i.d. random variables and use 
Theorem 4.2.1. 

In the next example the conditions of Theorem 4.3.1 will be verified for a 
simple nonlinear regression model. 
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EXAMPLE 4.3.1. Consider the nonlinear regression model (4.3.1) with 
SX Bo) = log (By + x), where By and x, are scalars. Assume (i) the parameter 
space B is a bounded open interval (c, d), (ii) x, + f > ô > 0 for every t and for 
every BEB, and (ii) (x) are iid. random variables such that 
E([log (d + x)P) < ©. Prove that a root of (4.3.6) is consistent. 

First, note that log (8 + x,) is well defined because of assumptions (i) and 
(ii). Let us verify the conditions of Theorem 4.3.1. Condition A is clearly 
satisfied because of assumptions (i) and (ii). Condition B is satisfied because 


|log (f t x) — log ($, + x)= 9? t XB: m Bil 


by the mean value theorem, where f? (depending on x,) is between £, and £3, 
and because |f/* + x|! is uniformly bounded on account of assumptions (i) 
and (ii). Condition C follows from assumption (iii) because of Theorem 4.2.1. 
To verify condition D, use the mean value theorem to obtain 


T 
F X, lo (8+ x) — log (B+ x) 


1 T 
=F » (i + xy] (B- hy 
= 
where fj? (depending on x,) is between f and ff. But 


TEUC EG 


t=1 


and 
plim 7; > (d+x)?=Ed+x)?>0 


because of assumptions (i), (ii), and (iii) and Theorem 3.3.2 (Kolmogorov 
LLN 2). Therefore condition D holds. 


When f fo) has a very simple form, consistency can be proved more simply 
and with fewer assumptions by using Theorem 4.1.1 or Theorem 4.1.2 di- 
rectly rather than Theorem 4.3.1, as we shall show in the following example. 


EXAMPLE 4.3.2. Consider the nonlinear regression model (4.3.1) with 
fo) = (Bo + x; and assume (i) a S fo S b where a and b are real numbers 
such that a < b, (ii) liM» TEZ x = q, and (iii) limye T^! ZL,x?- 
p > q’. Prove that the value of f that minimizes XL [y, — (f + xp in the 
domain [a, b] is a consistent estimator of f. 
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We have 
1 T 
plim z $ D, (x 
= aj + (Bi — BY + Bo — Bp + 4(6 — PXB — Ba 
= O(f), 
where the convergence is clearly uniform in £ € [a, b]. But, because 


Q(B) = 06 + (By — BY (f + B+ 29Y, 


Q(B) is uniquely minimized at f = fl. Therefore the estimator in question is 
consistent by Theorem 4.1.1. 


4.3.3 Asymptotic Normalitv 


We shall now prove the asymptotic normality of the NLLS estimator of f by 
making assumptions on f, to satisfy the assumptions of Theorem 4.1.3. 
First, consider assumption C of Theorem 4.1.3. We have 


8$r af, 
ap X iy: -APN p (4.3.12) 
Therefore we have 
1 8$, T Of 
— —I| =— 4.3.13 
FAR P LER (43.19 


The results of Section 3.5 show that if we assume in the nonlinear regression 
model that 


lim 1 ASA oO, 
r= T £ E RETZN 
is a finite nonsingular matrix,? 

then the limit distribution of (4.3.13) is N(0, 403C). 


Second, consider the assumption of Theorem 4.1.5 that implies assumption 
B of Theorem 4.1.3, except for nonsingularity. From (4.3.12) we have 


1 PSr 25 HM 25, EA 
T app’ T ABO! T AU apap’ 


(2C) (4.3.14) 


(4.3.15) 
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Pf 
=A, + ^ + As. 


We must make the assumption of Theorem 4.1.5 hold for each of the three 
terms in (4.3.15). First, for A, to satisfy the assumption of Theorem 4.1.5, we 
must assume 


A & 9f, 9f. 
T Èa 1 OB ap’ 
forall f in an open neighborhood of fo. 


converges to a finite matrix uniformly (4.3.16) 


For A, to converge to 0 in probability uniformly in a neighborhood of By, we 
require (as we can infer from the proof of the convergence of A, in Theorem 
4.3.1) that 


8f, 


is continuous in in an open (4.3.17) 
OBB, f i 
neighborhood of f uniformlyin t 
and 
1 Zl a, ] 
— = 4.3.1 
lim 53 | 2% 0 forall f (4.3.18) 


in an open neighborhood of fo. 
Finally, the uniform convergence of À, requires 
14 ey, 
forall f, and f, inan open neighborhood of fp. (4.3.19) 


converges to a finite matrix uniformly 


Thus under these assumptions we have 


lS 9 
T & pla afl, 


(4.3.20) 


whenever plim £F = fy. 
These results can be summarized as a theorem. 


THEOREM 4.3.2. In addition to the assumptions of the nonlinear regression 
model and the assumptions of Theorem 4.3.1, assume conditions (4.3.14), 
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and (4.3.16) - (4.3.19). Then, if Br is a consistent root of the normal equation 
(4.3.6), we have 


VT( Br — Bo) > NO, 63C-'). (4.3.21) 


As in condition C of Theorem 4.3.1, a simple way to verify conditions 
(4.3.16) and (4.3.19) is to assume that x, are iid. and use Theorem 4.2.1. 
Condition (4.3.18) follows from the uniform convergence of 77!27, 
(27/,/28,08,)?, which in turn can be verified by using Theorem 4.2.1. It would 
be instructive for the reader to verify these conditions for Example 4.3.1 and to 
prove the asymptotic normality of the estimator in question. 

When /( fly) has a simple form, the convergence of T~'d?.S-/dB6B’ can be 
easily established and, therefore, a cumbersome verification of conditions 
(4.3.16)-(4.3.19) can be avoided. To illustrate this point, we shall consider the 
model of Example 4.3.2 again. 


EXAMPLE 4.3.3. Consider the model of Example 4.3.2 and, in addition to 
conditions (i), (ii), and (iii) given there, also assume (iv) a < fy < b and (v) 
lim; TEL x = r. Obtain the asymptotic distribution of the value of f 
that minimizes XZL.[y, — (f + x,’ in the domain (— c, o). 

Note that here the minimization is carried out over the whole real line 
whereas in Example 4.3.2 it was done in the interval [a, b]. We denote the 
unconstrained minimum by f and the constrained minimum by £. In Exam- 
ple 4.3.2 we proved plim B= Bo; But, because fp is an interior point of [a, b] 
by assumption (iv), lim P[ B= Bi — ]; therefore we also have plim B- p 

Using (4.1.11), we have 


5 ] #S 711 as 
J/T(É — = -[z —r | — 5T 4.3.22 
where f* lies between B and fo. We have 
1 OS; 4 
——I = + 4.3.23 
VT Bl, VT È “fo + x). (4323) 


Therefore, using assumptions (ii) and (iii) and Theorems 3.5.3 and 3.5.5, we 
obtain 
1 dS; 
JT Of |p, 
We also have 


— N[0, 1602( B + p + 28,4)]. (4.3.24) 


„~ TÀUU X—LED-GUex»BO 6323) 
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Because the right-hand side of (4.3.25) is a continuous function ofa finite fixed 
number of sequences of random variables, we can use Theorem 3.2.6 to 
evaluate its probability limit. Thus, because plim £* = fj, we obtain 


1 PSr 
T f|. 


plim — = 8(fjà + p + 28,0). (4.3.26) 


Finally, (4.3.22), (4.3.24), and (4.3.26) imply by Theorem 3.2.7 (iii) that 


VTB- 8) nfo, —— | 4.3.27 

(7A) N [9 200 E p+ Dia) (4327) 

The distribution of the NLLS estimator may be approximated more accu- 
rately by using the Edgeworth expansion (see Pfanzagl, 1973). 


4.3.4 Bootstrap and Jacknife Methods 


In this subsection, we shall consider briefly two methods ofapproximating the 
distribution of the nonlinear least squares estimator; these methods are called 
the bootstrap and the jackknife methods (see Efron, 1982, for the details). As 
the reader will see from the following discussion, they can be applied to many 
situations other than the nonlinear regression model. 

The bootstrap method is carried out in the following steps: 

1. Calculate u, = y, — f(), where f is the NLLS estimator. 

2. Calculate the empirical distribution function F of (u,). 

3. Generate NT random variables (uf), i—1,2, .. , N and t= 

1,2,. . . , T, according to F, and calculate y$ = f(B) + uf. 
4. Calculate the NLLS estimator Jf that minimizes 


X bts? fori=1,2,...,N. 


5. Approximate the distribution of B by the empirical distribution function 


of (8f). 
The jackknife method works as follows: Partition y as y= 
(yi, Y2, - - - » Yy), where each y; is an m-vector such that mN = T. Let B be 


the NLLS estimator using all data and let Ê be the NLLS estimator ob- 
tained by omitting y,. Then “pseudovalues” fff = NÊ- (N- DÊ, i= 
l, 2,. . . , N,can be treated like N observations (though not independent) on 
B. Thus, for example, VB may be estimated by (N-— 1)! 
XM(Bf — B*)\ BF — B*y', where B* = N'ER, BP. 
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It is interesting to note that fj* may be regarded as an estimator of f in its 
own right and is called the jackknife estimator. Akahira (1983) showed that in 
the i.i.d. sample case the jackknife estimator is asymptotically equivalent to 
the bias-corrected maximum likelihood estimator (see Section 4.2.4) to the 
order T~! 


4.3.5 Tests of Hypotheses 


In the process of proving Theorem 4.3.2, we have in effect shown that asymp- 
totically 


Br ~— Bo = (G'G) 'G'u, (4.3.28) 


where we have put G = (6f/08’),. Note that (4.3.28) exactly holds in the linear 
case because then G = X. The practical consequence of the approximation 
(4.3.28) is that all the results for the linear regression model (Model 1) are 
asymptotically valid for the nonlinear regression model if we treat G as the 
regressor matrix. (In practice we must use G = (6f/4f’)g, where Bi is the NLLS 
estimator.) 

Let us generalize the t and F statistics of the linear model by this principle. If 
the linear hypothesis Q'fi = c consists of a single equation, we can use the 
following generalization of (1.5.4): 


QA-c 
&[Q'(GG) Q”? 
where * means “asymptotically distributed as" and 6? = (T — K y s,(B). 
Gallant (1975a) examined the accuracy of the approximation (4.3.29) by a 
Monte Carlo experiment using the model 


JAB) = Bi Xu + Box + By exp (£54). (4.3.30) 


For each of the four parameters, the empirical distribution of the left-hand 
side of (4.3.29) matched the distribution of t;_, reasonably well, although, as 
we would suspect, the performance was the poorest for f;. 

If Q'B = cconsists of q(^ 1) equations, we obtain two different approximate 
F statistics depending on whether we generalize the formula (1.5.9) or the 
formula (1.5.12). Generalizing (1.5.9), we obtain 


T— K S7(B) -SÊ A ^ Faq, T — 1), (4.3.31) 


q S7(B) 
where B is the constrained NLLS estimator obtained by minimizing S;(f) 


A tris (4.3.29) 
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subject to Q’B = c. Generalizing the formula (1.5.12), we obtain 
T — K (QÊ — 'IoG'G) QT (Q'À -À a 

q S7(A) 
These two formulae were shown to be identical in the linear model, but they 
are different in the nonlinear model. A Monte Carlo study by Gallant (1975b), 


using the model (4.3.30), indicated that the test based on (4.3.31) has higher 
power than the test based on (4.3.32). 


A F(q, T — k). (4.3.32) 


4.4 Methods of Iteration 


Be it for the maximum likelihood or the nonlinear least squares estimator, we 
cannot generally solve the equation ofthe form (4.1.9) explicitly for 8. Instead, 
we must solve it iteratively: Start from an initial estimate of 0 (say à) and 
obtain a sequence of estimates (8) by iteration, which we hope will converge 
to the global maximum (or minimum) of Q; or at least a root of Eq. (4.1.9). 
Numerous iterative methods have been proposed and used. In this section we 
shall discuss several well-known methods that are especially suitable for ob- 
taining the maximum likelihood and the nonlinear least squares estimator. 
Many of the results of this section can be found in Goldfeld and Quandt 
(1972), Draper and Smith (1981), or, more extensively, in Quandt (1983). 


4.4.1 Newton-Raphson Method 


The Newton-Raphson method is based on the following quadratic approxi- 
mation of the maximand (or minimand, as the case may be): 


Q(6) = Q(6,) + g;(0 — ,) + 4(0 — &Y H4(6 — 6,), (44.1) 
where 8, is an initial estimate and 
9Q .80 
8 39 ê and H, 3000" |;. 


The second-round estimator 8, ofthe Newton-Raphson iteration is obtained 
by maximizing the right-hand side of the approximation (4.4.1). Therefore 


6, = 6, — Hyg. (4.4.2) 


The iteration (4.4.2) is to be repeated until the sequence (8) thus obtained 
converges. 
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Inserting iteration (4.4.2) back into approximation (4.4.1) yields 
Q(6,) = a6.) 7 4(ô, = 6,yH,(6, m 6). (4.4.3) 


Equation (4.4.3) shows a weakness of this method: Even if (4.4.3) holds 
exactly, Q(@,) > Q(0,) is not guaranteed unless H, is a negative definite ma- 
trix. Another weakness is that even if H, is negative definite, 9; — 0, may be 
too large or too small: If it is too large, it overshoots the target; if it is too small, 
the speed of convergence is slow. 

The first weakness may be alleviated if we modify (4.4.2) as 


6, = 6, — (H, ^ a IY 'g,, (4.4.4) 


where I is the identity matrix and a, is a scalar to be appropriately chosen by 
the researcher subject to the condition that H, — œI is negative definite. This 
modification was proposed by Goldfeld, Quandt, and Trotter (1966) and is 
called quadratic hill-climbing. [Goldfeld, Quandt, and Trotter (1966) and 
Goldfeld and Quandt (1972, Chapter 1) have discussed how to choose a, and 
the convergence properties of the method.] 

The second weakness may be remedied by the modification 


6, = 6, - AH; !g,, (4.4.5) 


where the scalar A, is to be appropriately determined. Fletcher and Powell 
(1963) have presented a method to determined 4, by cubic interpolation of 
Q(8) along the current search direction. [This method is called the DFP 
iteration because Fletcher and Powell refined the method originally proposed 
by Davidson (1959).] Also, Berndt et al. (1974) have presented another 
method for choosing A,. 

The Newton-Raphson method can be used to obtain either the maximum 
likelihood or the nonlinear least squares estimator by choosing the appropri- 
ate Q. In the case of the MLE, E(é log L/d600’) may be substituted for 
9 log L/90080' in defining H. If this is done, the iteration is called the method of 
scoring (see Rao, 1973, p. 366, or Zacks, 1971, p. 232). In view of Eq. (4.2.22), 
— E(8 log L/00Y(8 log L/30") may be used instead; then we need not calculate 
the second derivatives of log L. 


4.4.2 The Asymptotic Properties of the Second-Round Estimator in the 
Newton-Raphson Method 


Ordinarily, iteration (4.4.2) is to be repeated until convergence takes place. 
However, if ô, is a consistent estimator of 0, such that VT T(6, — 0.) has a proper 


Asymptotic Properties of Extremum Estimators 139 


limit distribution, the second-round estimator 6, has the same asymptotic 
distribution as a consistent root of Eq. (4.1.9). In this case further iteration 
does not bring any improvement, at least asymptotically. To show this, con- 
sider 


9Q 
EE 727ME 7 


where 0* lies between 6, and Ó,. Inserting (4.4.6) into (4.4.2) yields 


a ~1 a 
YT(06, — 05) = h — [22 | E |l JT(8,—6) — (447) 


ze (8, — 0,), (4.4.6) 


0080’ 
- [+ ?Q [| _1 ô 
" VT 90 " 
But, because under the condition of Theorem 4.1.3 
1 PQ _ 189Q| _ 1 Q 
lim F ggal, P T agale P T 3gg| 4) 
we have 
5 g2 l olim 1.229] L2 
JT(0, — 05) | ptim T 3866 | TF 30h (4.4.9) 


which proves the desired result. 


4.4.8 Gauss-Newton Method 


The Gauss-Newton method was specifically designed to calculate the nonlin- 
ear least square estimator. Expanding f£) of Eq. (4.3.5) in a Taylor series 
around the initial estimate £,, we obtain 


ND = fb) + ran 


Substituting the right-hand side of (4.4.10) for (f) in (4.3.5) yields 


s= Y | 1-18) A B- -ô f. (4.4.11) 


ti 


(B — B). (4.4.10) 


The second-round estimator Ê of the Gauss-Newton iteration is obtained by 
minimizing the right-hand side of approximation (4.4.11) with respect to f as 
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USS! 9 T as 
bh AD apla B's | Bla (4.4.12) 
where 

Ch) —.— T BDF àf, 

al; ——2 2 [s — /(B)] apli, (4.4.13) 


The iteration (4.4.12) is to be repeated until convergence is obtained. This 
method involves only the first derivatives of f,, whereas the Newton-Raphson 
iteration applied to nonlinear least squares estimation involves the second 
derivatives of f, as well. 

The Gauss-Newton iteration may be alternatively motivated as follows: 
Evaluating the approximation (4.4.10) at fly and inserting it into Eq. (4.3.1), 
we obtain 


of 
ap |, ĝ = ap"|; 


Then the second-round estimator Ê can be interpreted as the least squares 
estimate of f, applied to the linear regression equation (4.4.14), treating the 
whole left-hand side as the dependent variable and (0f,/0f")3, as the vector of 
independent variables. Equation (4.4.14) reminds us ofthe point raised at the 
beginning of Section 4.3.5, namely, the nonlinear regression model asymptot- 
ically behaves like a linear regression model if we treat of/df’ evaluated at a 
good estimate of f as the regressor matrix. 

The Gauss-Newton iteration suffers from weaknesses similar to those ofthe 
Newton-Raphson iteration, namely, the possibility of an exact or near singu- 
larity ofthe matrix to be inverted in (4.4.12) and the possibility oftoo much or 
too little change from f, to f... 

To deal with the first weakness, Marquardt (1963) proposed a modification 


af| 9f, [| aS 
I zl o 4.4.15 
1-135 Opis, agl; ĝi T" op b, ) ) 


where œ, is a positive scalar to be appropriately chosen. 
To deal with the second weakness, Hartley (1961) proposed the following 
modification: First, calculate 


if Zo ef [ as 
A=] S = — 4.4.16 
! 2 P oi, op’ Ê op A, 


and, second, choose A, to minimize 


— f(B) tI 


3l. Bot u. (4.4.14) 
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S), +4,4), 054,51. (44.17) 


Hartley proved that under general conditions his iteration converges to a root 
of Eq. (4.3.6). (Gallant, 1975a, has made useful comments on Marquardt's 
and Hartley's algorithms.) 

Asin the Newton-Raphson method, it can be shown that the second-round 
estimator of the Gauss-Newton iteration is asymptotically as efficient as 
NLLS if the iteration is started from an estimator f, such that VT(B, — Bo) 
converges to a nondegenerate random variable. 

Finally, we want to mention several empirical papers in which the Gauss- 
Newton iteration and related iterative methods have been used. Bodkin and 
Klein (1967) estimated Cobb-Douglas and CES production functions by the 
Newton-Raphson method. Charatsis (1971) estimated the CES production 
function by a modification of the Gauss-Newton method similar to that of 
Hartley (1961) and found that in 64 out of 74 samples it converged within six 
iterations. Mizon (1977), in a paper whose major aim was to choose among 
nine production functions including Cobb-Douglas and CES, used the conju- 
gate gradient method of Powell (1964) (see Quandt, 1983). Mizon’s article also 
contained interesting econometric applications of various statistical tech- 
niques we shall discuss in Section 4.5, namely, a comparison of the likelihood 
ratio and related tests, Akaike information criterion, tests of separate families 
of hypotheses, and the Box-Cox transformation (Section 8.1.2). Sargent 
(1978) estimated a rational expectations model (which gives rise to nonlinear 
constraints among parameters) by the DFP algorithm. 


4.5 Asymptotic Tests and Related Topics 
4.5.1 Likelihood Ratio and Related Tests 


Let L(x, 0) be the joint density of a 7-vector of random variables x — 
(Xis X23. . - , XrY characterized by a K-vector of parameters 0. We assume 
all the conditions used to prove the asymptotic normality (4.2.23) of the 
maximum likelihood estimator @. In this section we shall discuss the asymp- 
totic tests of the hypothesis 


h(0) — 0, (4.5.1) 


where h is a q-vector valued differentiable function with q < K. We assume 
that (4.5.1) can be equivalently written as 


0 — r(a), (4.5.2) 
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where « is a p-vector of parameters such that p= K — q. We denote the 
constrained maximum likelihood estimator subject to (4.5.1) or (4.5.2) as 
0 = r(à). 

Three asymptotic tests of (4.5.1) are well known; they are the likelihood 
ratio test (LR T), Wald's test (Wald, 1943), and Rao’s score test (Rao, 1947). 


The definitions of their respective test statistics are 


max LO) 
LRT--2logX9?79 &—— = 9 — 8 
Bax LO) 2[log L(8) — log L(8)], (4.5.3) 
; # log L| |! à&'| 17! | 
Wald = —h(6) E EU ] at n(6), (4.5.4) 
__ dlogL Plog L [se 
Rao = ~~ ayy l 3000 |,| 90 |, 45.5) 


Maximization of log L subject to the constraint (4.5.1) is accomplished by 
setting the derivative of log L — A'h(0) with respect to 0 and A to 0, where A is 
the vector of Lagrange multipliers. Let the solutions be 0 and 4. Then they 
satisfy 


à ð log Ł L . 9h 
0 6 — 68]; 
Inserting this equation into the right-hand side of (4.5.5) yields Rao = — A'BÀ 


where 
naL log L T oh’ 
“9 0630’ E 
Silvey (1959) showed that B is the asymptotic variance-covariance matrix of A 
and hence called Rao's test the Lagrange multiplier test. For a more thorough 
discussion of the three tests, see Engle (1984). 

All three test statistics can be shown to have the same limit distribution, 
x(q), under the null hypothesis. In Wald and Rao, & log L/d00@ can be 
replaced with T plim T~'¢ log L/0000' without affecting the limit distribu- 
tion. In each test the hypothesis (4.5.1) is to be rejected when the value ofthe 
test statistic is large. 

We shall prove LRT — x?(q). By a Taylor expansion we have 


PET 


log Z(0,) = log LÔ) + 


RG — 6) (4.5.6) 


# log L a 
— (6, — 0), 


1 
+5 (8) — BY 
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where 9* lies between 6, and 6. Noting that the second term of the right-hand 


side of (4.5.6) is 0 by the definition of 0, we have 
log L(8) — log L(8) © 3T(8 — 6,’ (Ê — 8), 
where we have defined 


1 # log L 


Jg = —lim ET 7 3000" | 


x 
Treating L[r(a)] = L(a) as a function of œ, we similarly obtain 
log L(&) — log L(a) P 1 T(& — asy IÂ — 0%), 


where 


Noting L(0,) = Lla), we have from (4.5.3), (4.5.7), and (4.5.9) 
LRT E TÊ- 6246 — 6, — T(& — 06) 7, (& — oq). 


But from Theorem 4.1.3 and its proof we have 


a I dlog L 
VT (6 — 6) © 7,1 — —5— 
0. 8 VT FY) " 
and 
a 1 ólogL 
J/T(à-a) 9 3; -= =]. 
Since 
log L| . PREE 
9X leg VT 90 |e, 
where R = (6r/da’),,, and 
1 ólogL 
VT 06 L — N(0, J), 


we have from (4.5.11)-(4.5.15) 
LRT © v(7;! — R7;Ru, 
where u ~ N(0, I). Finally, defining 
e= J; u ~ N(0, D, 


(4.5.7) 


(4.5.8) 


(4.5.9) 


(4.5.10) 


(4.5.11) 


(4.5.12) 


(4.5.13) 


(4.5.14) 


(4.5.15) 


(4.5.16) 


(4.5.17) 
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we obtain 
LRT = e(1— JPRIZR IY. (4.5.18) 
But, because 


=R’J,R, (4.5.19) 


I— Z2?RJ ,'R'Ji/? can be easily shown to be an idempotent matrix of rank q. 
Therefore, by Theorem 2 of Appendix 2, LRT > x(q). 

The proof of Wald — x?(q) and Rao — x*(q) are omitted; the former is very 
easy and the latter is as involved as the preceding proof. 

Next we shall find explicit formulae for the three tests (4.5.3), (4.5.4), and 
(4.5. 3) for the nonlinear regression model (4. 3.1) when the error u is normal. 
Let B be the NLLS estimator of fl, and let f be the constrained NLLS, that is, 
the value of f that minimizes (4.3.5) subject to the constraint h(f) = 0. Also, 
define G = (9f/"); and G = (af/af’)y. Then the three test statistics are de- 
fined as 


LRT = T[log T-'S,(B) — log ros» (4.5.20) 
—1 ^ 
Th(À) | zl 6 = pl; M h(f) 
Wald = ; (4.5.21) 
S7(B) 
and 
Rao = ZIY UDTG(G/G) "Gy — £00]. 4.5.22) 


S7(B) 

Because (4.5.20), (4.5.21), and (4.5.22) are special cases of (4.5.3), (4.5.4), and 
(4.5.5), all three statistics are asymptotically distributed as y?(q) under the null 
hypothesis if u is normal. Furthermore, we can show that statistics (4.5.20), 
(4.5.21), and (4.5.22) are asymptotically distributed as x?(g) under the null 
even if u is not normal. Thus these statistics can be used to test a nonlinear 
hypothesis under a nonnormal situation. 

In the linear model with linear hypothesis Q'f = 0, statistics (4.5.20)- 
(4.5.22) are further reduced to 


LRT = T log [S7(B)/S7(B)], (4.5.23) 

Wald = T[S;(B) — Sr (B)/Sz(B), (4.5.24) 
and 

Rao = T[S7(B) — S7(B)\/Sr(B). (4.5.25) 
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Thus we can easily show Wald = LRT = Rao. The inequalities hold also in 
the multiequation linear model, as shown by Berndt and Savin (1977). Al- 
though the inequalities do not always hold for the nonlinear model, Mizon 
(1977) found Wald = LRT most of the time in his samples. 

Gallant and Holly (1980) obtained the asymptotic distribution of the three 
statistics under local alternative hypotheses in a nonlinear simultaneous 
equations model. Translated into the nonlinear regression model, their results 
can be stated as follows: If there exists a sequence of true values { 87} such that 
lim B7= f, and ô= lim T'7(81— plim f) is finite, statistics (4.5.20), 
(4.5.21), and (4.5.22) converge to chi-square with q degrees of freedom and 
noncentrality parameter A, where 


oh’ oh óh'| [^ oh 
A= og?! 2 E G'Gy-: = | 2 ô. (4.5.26) 
Bla LB le ag | am. 


Note that if £ is distributed as a g-vector N(0, V), then (€ + uy V^ (C + u) is 
distributed as chi-square with g degrees of freedom and noncentrality 
parameter yz’V~‘u. In other words, the asymptotic local power of the tests 
based on the three statistics is the same. 

There appear to be only a few studies of the small sample properties of the 
three tests, some of which are quoted in Breusch and Pagan (1980). No 
clear-cut ranking of the tests emerged from these studies. 

A generalization of the Wald statistic can be used to test the hypothesis 
(4.5.1), even ina situation where the likelihood function is unspecified, aslong 
as an asymptotically normal estimator f of f is available. Suppose f is 
asymptotically distributed as N(8, V) under the null hypothesis, with V 
estimated consistently by V. Then the generalized Wald statistic is defined by 


-näy [2| 9 
G.W. = h(fl) E M ap 
and is asymptotically distributed as y2(q) under the null hypothesis. Note that 
(4.5.21) is a special case of (4.5.27). 

Another related asymptotic test is the specification test of Hausman (1978). 
It can be used to test a more general hypothesis than (4.5.1). The only 
requirement of the test is that we have an estimator, usually a maximum 
likelihood estimator, that is asymptotically efficient under the null hypothesis 
but loses consistency under an alternative hypothesis and another estimator 
that is asymptotically less efficient than the first under the null hypothesis but 
remains consistent under an alternative hypothesis. If we denote the first 


T h(f) (4.5.27) 
2 
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estimator by ó and the second by 6, the Hausman test statistic is defined by 
(8— y$- (8— 8), where V is a consistent estimator of the asymptotic 
variance-covariance matrix of (8— 6). Under the null hypothesis it is 
asymptotically distributed as chi-square with degrees of freedom equal to the 
dimension of the vector @. 

If we denote the asymptotic ` variance-covariance matrix by V, it is well 
known that v(6— 6) = v6) — vô). This equality follows from 
v(ĝ) = = V2, where V, is the asymptotic covariance between 6 and ĝ. To 
verify this equality, note that ifit did not hold, we could define a new estimator 
6+ [V(8) — VulIV(8— ày(8— 6), the asymptotic variance-covariance 
matrix of which is v(6) — [V(0) — Vov- 8)]- '[V(8) — V,2]’, which is 
smaller (in the matrix sense) than V( 6). But this is a contradiction because 6 is 
asymptotically efficient by assumption. 


4.5.2 Akaike Information Criterion 


The Akaike information criterion in the context of the linear regression model 
was mentioned in Section 2.1.5. Here we shall consider it in a more general 
setting. Suppose we want to test the hypothesis (4.5.2) on the basis of the 
likelihood ratio test (4.5.3). It means that we choose L(0) over L(a) if 


LRT > d, (4.5.28) 


where d is determined so that P[LRT > d|L(a)] = c, a certain prescribed 
constant such as 596. However, if we must choose one model out of many 
competing models L(a,), L(a;), . . . , this classical testing procedure is at 
best time consuming and at worst inconclusive. Akaike (1973) proposed a 
simple procedure that enables us to solve this kind of model-selection prob- 
lem. Here we shall give only a brief account; for the full details the reader is 
referred to Akaike's original article or to Amemiya (1980a). 

We can write a typical member of the many competing models 
L(a,), L(a,),. . . as L(a). Akaike proposed the loss function 


a 2 L(@) 
W(0,, à) = T Í E TG PA L( 9) dx, (4.5.29) 
where & is treated as a constant in the integration. Because W( 4, à) = 
W(0,, 05) = 0, (4.5.29) can serve as a reasonable loss function that is to be 
minimized among the competing models. However, because W depends on 
the unknown parameters 0, a predictor of W must be used instead. After 
rather complicated algebraic manipulation, Akaike arrived at the following 
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simple predictor of W, which he called the Akaike Information Criterion 
(AIC): 


-2 3 4 2P 
AIC T log L(a) + T’ (4.5.30) 


where p is the dimension of the vector œ. The idea is to choose the model for 
which AIC is smallest. Akaike’s explanation regarding why AIC (plus a certain 
omitted term that does not depend on & or p) may be regarded as a good 
predictor of W is not entirely convincing. Nevertheless, many empirical re- 
searchers think that AIC serves as a satisfactory guideline for selecting a 
model. 


4.5.3 Tests of Separate Families of Hypotheses 


So far we have considered testing or choosing hypotheses on parameters 
within one family of models or likelihood functions. The procedures dis- 
cussed in the preceding sections cannot be used to choose between two entirely 
different likelihood functions, say L,(@) and L,(y). For example, this case 
arises when we must decide whether a particular sample comes from a lognor- 
mal or a gamma population. Such models, which do not belong to a single 
parametric family of models, are called nonnested models. 

Suppose we want to test the null hypothesis L,against the alternative L,. 
Cox (1961, 1962) proposed the test statistic 


R,— log L,(8) — [Eg log L,(0)]j — log L,(j) (4.5.31) 
t [Es log L,(yo)ló; 


where y, = plime f (meaning the probability limit is taken assuming L,(6) is 
the true model) and ^ indicates maximum likelihood estimates. We are to 
accept L, if R,is larger than a critical value determined by the asymptotic 
distribution of Ry. Cox proved that R, is asymptotically normal with zero 
mean and variance equal to E(v?) — E(vjw;( Ew,wj) ! E(vyw,), where v,— 
log L0) — log L,(y9) — Es[log L0) — log L,(yj)) and w,= ð log L,(0) 
90. Amemiya (1973b) has presented a more rigorous derivation of the asymp- 
totic distribution of Ry. 

A weakness of Cox's test is its inherent asymmetry; the test of L-against L; 
based on R, may contradict the test of L, against L; based on the analogous 
test statistic R,. For example, L; may be rejected by Ryand at the same time 
L, may be rejected by R,. 
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Pesaran (1982) compared the power of the Cox test and other related tests. 
For other recent references on the subject of nonnested models in general, see 
White (1983). 


4.6 Least Absolute Deviations Estimator 


The practical importance of the least absolute deviations (LAD) estimator asa 
robust estimator was noted in Section 2.3. Besides its practical importance, 
the LAD estimator poses an interesting theoretical problem because the gen- 
eral results of Section 4.1 can be used to prove only the consistency of the LAD 
estimator but not its asymptotic normality, even though the LAD estimator is 
an extremum estimator. In Section 4.6.1 we shall prove the asymptotic nor- 
mality of the median, which is the LAD estimator in the i.i.d. sample case, 
using a method different from the method of Section 4.1.2 and shall point out 
why the latter method fails in this case. In Section 4.6.2 we shall use the general 
results of Section 4.1.1 to prove the consistency of the LAD estimator in a 
regression model. Finally, in Section 4.6.3 we shall indicate what lines of proof 
of asymptotic normality may be used for the LAD estimator in a regression 
model. 

The cases where the general asymptotic results of Section 4.1 cannot be 
wholly used may be referred to as nonregular cases. Besides the LAD estima- 
tor, we have already noted some nonregular cases in Section 4.2.3 and will 
encounter more in Sections 9.5 and 9.6. It is hoped that the methods outlined 
in the present section may prove useful for dealing with unsolved problems in 
other nonregular cases. 


4.6.1 Asymptotic Normality of the Median 


Let (Y,},f=1,2,..., T, be a sequence of i.i.d. random variables with 
common distribution function F and density function f. The population me- 
dian M is defined by 


F(M)- 3 (4.6.1) 


We assume F to be such that M is uniquely determined by (4.6.1), which 
follows from assuming f(y) > 0 in the neighborhood of y = M. We also as- 
sume that f’(y) exists for y > M in a neighborhood of M. Define the binary 
random variable Wa) by 
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Wia=1 if Y, za (4.6.2) 
=0 if Y,<a 
for every real number a. Using (4.6.2), we define the sample median my by 
T T 
mr = inf {a 2 Wa) s | . (4.6.3) 


The median as defined above is clearly unique.!! 
The asymptotic normality of mz can be proved in the following manner: 
Using (4.6.3), we have for any y 


T 
P(mr< M+ T My)-P p W(M + T-12y s 2 . (4.6.4) 
te1 
Define 
P,—1-—P(Y, « M + TM) 


Then, because by a Taylor expansion 
1 


P7 5 — T-^f(MDy — OT"), (4.6.5) 
we have 
T T 
P (3 Wis 3) = PZ; + O(T?) £ f(M)y], (4.6.6) 
fot 


where W* = WM + T-?y)and Zy = T-?X7 (W* — P,). We now derive 
the limit distribution of Z7-using the characteristic function (Definition 3.3.1). 
We have 


T 
E exp (iAZ,) = [] E exp [üT (W* — P) (4.6.7) 
tml 
T 
= J] {P, exp [AT "(1 — P,)] 
tml 
+(1— P,) exp (C iAT-!2P,)) 


_ [ A2 4 T 
= ert “AT | 
— exp (— 42/8), 
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where the third equality above is based on (4.6.5) and the expansion of the 


exponent: e*=1+x+27'x?+.... Therefore Zr N(0, 4^ !), which 
implies by Theorem 3.2.7 

Zr + OT")  N(0, 4^). (4.6.8) 
Finally, from (4.6.4), (4.6.6), and (4.6.8), we have proved 

VT(m; — M) NIO, 4 (MY?]. (4.6.9) 


The consistency of m; follows from statement (4.6.9). However, it also can 
be proved by a direct application of Theorem 4.1.1. Let m be the set of the 0 
points that minimize’? 


T T 
Sr= YY,-6- Y Y,- Mi (4.6.10) 
t=! t-1 


Then, clearly, mz € my. We have 
M w 
Q = plim T-!S7 = o+ f Af(A) a-z | fA dj, (4611) 
8 e 


where the convergence can be shown to be uniform in 0. The derivation of 
(4.6.11) and the uniform convergence will be shown for a regression model, 
for which the present i.i.d. model is a special case, in the next subsection. 
Because 


dQ __ 

—=-—1+2F(6) (4.6.12) 
and 

PQ _ 

ET 2f(0), (4.6.13) 


we conclude that Q is uniquely minimized at 0 = M and hence m,is consist- 
ent by Theorem 4.1.1. 

Next, we shall consider two complications that prevent us from proving the 
asymptotic normality of the median by using Theorem 4.1.3: One is that 
0S,/00 = 0 may have no roots and the other is that 025/00? = 0 except for a 
finite number of points. These statements are illustrated in Figure 4.2, which 
depicts two typical shapes of S7. 

Despite these complications, assumption C of Theorem 4.1.3 is still valid if 
we interpret the derivative to mean the left derivative. That is to say, define for 
A>0O 


Asymptotic Properties of Extremum Estimators 151 


_, OS 
“YT = t 
(i) 3g O has roots 


4 OST 
(ii) 3g ^O has no roots 


Figure 4.2 Complications proving the asymptotic normality of the median 


OS; . S7(8) = S4(0 — A) 4 
T = A .6.14 
ag os A (46.14) 
Then, from (4.6.10), we obtain 
0S T 
— = — 2W(M )]. 4.6.15 
36 |, 2 [1 (M) (4.6.15) 


Because {WM )) are i.i.d. with mean 4 and variance 4, we have by Theorem 
3.3.4 (Lindeberg-Lévy CLT) 


1 Ex 
—— —N(0,1) (4.6.16) 
TF 0 ly (0, 1) 

Assumption B of Theorem 4.1.3 does not hold because &S7/30? = 0 for 
almost every 9. But, if we substitute [020/80? ],, for plim T—'[07.S;/067],, in 
assumption B of Theorem 4.1.3, the conclusion of the theorem yields exactly 
the right result (4.6.9) because of (4.6.13) and (4.6.16). 
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4.6.2 Consistency of Least Absolute Deviations Estimator 


Consider a classical regression model 


y - X5, +u, (4.6.17) 


where X isa T X K matrix of bounded constants such that limye 7^! X'Xisa 
finite positive-definite matrix and u is a 7-vector of i.i.d. random variables 
with continuous density function f( : ) such that f3 f(A) dA = 4 and f(x) > 0 
for all x in a neighborhood of 0. It is assumed that the parameter space B is 
compact. We also assume that the empirical distribution function of the rows 
of X, (x/), converges to a distribution function. The LAD estimator f is 
defined to be a value of f that minimizes 


T T 
Sr= Y, y — xifl — Y, ul (4.6.18) 
ml tT] 


This is a generalization of (4.6.10). Like the median, the LAD estimator may 
not be unique. 

We shall now prove the consistency of the LAD estimator using Theorem 
4.1.1. From (4.6.18) we have 


T 
Sr= Y h(ulx;), (4.6.19) 
t=1 


where 6 = fl — B, and h(z|q) is defined as 


Ifaz0,  Azo)-ao if zs0 (4.6.20) 
=a—2z if 0<z<a 


=—-a if z2a. 
Ifa <0, h(z|a) = a if zSa 

=—-at+2z if a<z<0 

—-—Q if z20. 


Then h(z|x;ó) is a continuous function of ô uniformly in t and is uniformly 
bounded in ¢ and ô by our assumptions. Therefore h(u,|x/d) — Eh(u,|x;ó) 
satisfies the conditions for g(y, 0) in Theorem 4.2.2. Moreover, limy... T~! 
ZLEh(u,x;ó) can be shown to exist by Theorem 4.2.3. Therefore: 
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Q=plim ni (4.6.21) 


=2 lim — >) ° 0 dà 


Toe 


—2 lim = = sf fA) dà - xa] + lim 7; LY x/ó, 
dod T ái 
where the convergence of each term is uniform in ó by Theorem 4.2.3. 
Thus it only remains to show that Q attains a global minimum at ó — 0. 
Differentiating (4.6.21) with respect to ó yields 


LL . 1S 
98 ^ 2 lim »E |], f(A dÀ - x] + lim TÈ X, (4.6.22) 
which is equal to 0 at ó = 0 because f f(A) dà = 4 by our assumption. More- 
over, because 


9Q 


3598 =2 lim = > f(xió)x,x; (4.6.23) 


is positive definite at ó = 0 because of our assumptions, Q attains a local 
minimum at ó = 0. Next we shall show that this local minimum is indeed the 
global minimum by showing that àQ/àó + 0 if ó ¥ 0. Suppose dQ/dd = 0 at 
6, #0. Then, evaluating (4.6.22) at ô and premultiplying it by ô; , we obtain 


. Ie = 
lim — Y E E Í fo) a| xô, — 0. (4.6.24) 
T t-1 2 xó 


To show (4.6.24) is a contradiction, let a,, a, and M be positive real numbers 
such that |x/d,| < M for all ż and f(A) = a, whenever |4| £ az. Such numbers 
exist because of our assumptions. Then we have 


T 1 eo 
$|- L fA a xs i=> 57 Í f(A) dal|xid,| — (4.625) 
2 t-1 xó 
z 92 à X'Xó,. 


Therefore (4.6.24) is a contradiction because of our assumption that 
lim T~'X’X is positive definite. This completes the proof of the consistency of 
the LAD estimator by means of Theorem 4.1.1. 
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4.6.3 Asymptotic Normality of Least Absolute Deviations Estimator 


As noted in Section 4.6.1, the asymptotic normality of LAD cannot be proved 
by means of Theorem 4.1.3; nor is the proof of Section 4.6.1 easily generaliz- 
able to the regression case. We shall give a brief outline ofa proof of asymptotic 
normality. The interested reader should refer to Koenker and Bassett (1982) 
and the references of that article.!? . 

The asymptotic normality of the LAD estimator is based on the following 
three fundamental results: 


1 T o As. 
JT 2 xy(y, — xh) —— 0, (4.6.26) 
{m 
where y(x) = sgn (x), 


1 2 ^ 1&4 
-= —xf)-—— 4.6.27 
JT p Xy, xB) JT p xy (u) ( ) 


1 = ray. 1 = ERES 
[E p x [Ew(y, — x;f)lg F 2 x,Ey(y, «po 0, 
and 


12 12 
— Y Ey — xB = Y, x. Ew — xpo) (4.6.28) 
VT & VT & 


. 12 ð , . a 
+ plim T p X, [5 Ey(y, — «5| YT(B Bo). 
These results imply 
2 . 14 ð ^l 
vT/( B — Bo) = {lim T > X, E Ey(y, — |) (4.6.29) 


t=t 


] 
i JT 2 xy (u) 
Noting Ey (y, — xf) = 1 — 2F(x/d), Ew(u,) = 0, and Vy(u,) = 1, we obtain 
VT(B — Bo) > NO, 4- !/(0)7?lim T7XXI). (4.6.30) 


The proof of (4.6.26) is straightforward and is given in Ruppert and Carroll 
(1980, Lemma A2, p. 836). The proof of (4.6.27) is complicated; it follows 
from Lemma 1 (p. 831) and Lemma A3 (p. 836) of Ruppert and Carroll, who 
in turn used results of Bickel (1975). Equation (4.6.28) is a Taylor series 
approximation. 
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Exercises 


l. 


(Section 4.1.1) 
Prove (i) = (ii) = (iii) in Definition 4.1.1. 


. (Section 4.1.1) 


In the model of Exercise 11 in Chapter 1, obtain the probability limits of 
the two roots of the likelihood equation, assuming limz.,. T~!x’x — c, 
where c is a finite, positive constant. 


. (Section 4.1.1) 


In the model of Exercise 2 (this chapter), prove the existence of a consist- 
ent root, using Theorem 4.1.2. 


. (Section 4.1.2) 


Suppose that (X 7.) are essentially bounded; that is, for any e > 0, there 
exists M, such that P(|JX,;« M) =1—e for all T. Show that if 
plim,.,, Y;= 0, then plim;_... X,Y, = 0. (This is needed in the proof of 
Theorem 4.1.4.) 


. (Section 4.2.2) 


Prove (4.2.18) by verifying Condition D given after (4.1.7). Assume for 
simplicity that ø? is known. (Proof for the case of unknown c? is similar 
but more complicated.) 


. (Section 4.2.2) 


LetX,,4—1,2,...,n,t—1,2,. .. , T, be independent with the dis- 
tribution M(u,, 72). Obtain the probability limit of the maximum likeli- 
hood estimator of a? assuming that n is fixed and T goes to infinity (cf. 
Neyman and Scott, 1948). 


. (Section 4.2.3) 


Let (X), t= 1,2,. . . , T, beii.d. with the probability distribution 
X,—1 with probability p 
=0Q with probability 1 — p. 
Prove the consistency and asymptotic normality of the maximum likeli- 


hood estimator using Theorems 4.1.1 and 4.2.4. (The direct method is 
much simpler but not to be used here for the sake of an exercise.) 


. (Section 4.2.3) 


Prove the asymptotic normality of the consistent root in the model of 
Exercise 2 (this chapter). 
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9. 


12. 


13. 


14, 


15. 


(Section 4.2.3) ` 
Let {X,} be i.i.d. with uniform distribution over (0, 0). Show that if 0 is 
defined by T^ (T + 1) max (X,, X2,. . . , Xr), 

limz ,, P [T(@ — 0) < x] = exp (07!x — 1) for x £ 8. 


. (Section 4.2.3) 


Consider the model 
Y= Bo t tu, t=1,2,...,7, 


where y, and u, are scalar random variables and fj, is a scalar unknown 
parameter. If (uj) are iid. with Eu, — 0, Eu? = fà, Eu3=0, and 
Euj = m, (note that we do not assume the normality of u,), which of the 
following three estimators do you prefer most and why? 

0) f, T" XL, 

(2) fi, which maximizes S = —(7/2) log f? — (1/287) Z7... (y, — BY, 
(3) $ defined as 0.5 times the value of f that minimizes 


. (Section 4.2.3) 


Derive the asymptotic variance of the estimator of f obtained by mini- 
mizing ZZ (y, — Bx,)*, where y, is independent with the distribution 
N(flx,, o2)andlim,.,., T^! ZLx2isa finite, positive constant. You may 
assume consistency and asymptotic normality. Indicate the additional 
assumptions on x, one needs. Note if Z ~ N(0, o?), EZ?* = a?4(2k)!/ 
(2*k!). 


(Section 4.2.4) 


Complete the proof of Example 4.2.4 — the derivation of the asymptotic 
normality of the superefficient estimator. 


(Section 4.2.5) 
In the model of Example 4.2.3, obtain the asymptotic variance-covar- 
iance matrix of f using the concentrated likelihood function in f. 


(Section 4.3.2) 
What assumptions are needed to prove consistency in Example 4.3.2 
using Theorem 4.3.1? 


(Section 4.3.3) 
Prove the asymptotic normality ofthe NLLS estimator in Example 4.3.1. 


16. 


17. 


20. 


21. 
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(Section 4.3.3) 
Consider a nonlinear regression model 


Y= (Bo + xy tu, 


where we assume 

(A) (uj areiid. with Eu, = 0 and Vu, = o$. 

(B) Parameter space B = [— 3, 4]. 

(C) (x,}arei.i.d. with the uniform distribution over [1, 2], distributed 
independently of (uj). [EX' = (r + 1)1(27*! — 1) for every positive or 
negative integer r except r = — 1. EX ^! = log 2.] 

Define two estimators of fp: 

(1) f minimizes $7(8) = Zii[y, — (B + x,’ over B. 

(2) f minimizes W,( 8) = ZZi(y/(B + x)? + log [(B + x)!]) over B. 
If f, = 0, which ofthe two estimators do you prefer? Explain your prefer- 
ence on the basis of asymptotic results. 


(Section 4.3.5) 
Your client wants to test the hypothesis a + f = 1 in the nonlinear re- 
gression model 


Q, = LFK! + u, t=1,2,...,7, 


where L, and K, are assumed exogenous and {u,} are i.i.d. with Eu, = 0 
and Vu, — 07. Write your answer in such a way that your client can 
perform the test by reading your answer without reading anything else, 
except that you may assume he can compute linear least squares estimates 
and has access to all the statistical tables and knows how to use them. 
Assume that your client understands high-school algebra but not calculus 
or matrix analysis. 


. (Section 4.4.3) 


Prove the asymptotic efficiency of the second-round estimator of the 
Gauss-Newton iteration. 


. (Section 4.5.1) 


Prove Wald — %?(q), where Wald is defined in Eq. (4.5.4). 


(Section 4.5.1) 
Prove Rao — x?(q), where Rao is defined in Eq. (4.5.5). 


(Section 4.5.1) 
Show Wald z LRT z Rao, where these statistics are defined by (4.5.23), 
(4.5.24), and (4.5.25). 
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22. 


23. 


(Section 4.6.1) 

Consider the regression model y, = fl x, + u,, where {x,} are known con- 
stants such that limye T^! Z7 x is a finite positive constant and {u,} 
satisfy the conditions for {Y,} given in Section 4.6.1. By modifying the 
proof of the asymptotic normality of the median given in Section 4.6.1, 
prove the asymptotic normality of the estimator of f obtained by mini- 
mizing ZZ.,|y, — Bx. 


(Section 4.6.1) 

Let (Xj) be i.i.d. with a uniform density over (—4, 4) and let Y be the 
binary variable taking values 7^! and — T~' with equal probability 
and distributed independently of (X). Define W, = 1 if X, + Y = 0 and 
W, — 0 if X, + Y <0. Prove that T^!?27 (W, — 4) converges to a mix- 
ture of N(1, 4) and M(— 1, 1) with equal probability. 


5 Time Series Analysis 


Because there are many books concerned solely with time series analysis, this 
chapter is brief; only the most essential topics are considered. The reader who 
wishes to study this topic further should consult Doob (1953) for a rigorous 
probabilistic foundation of time series analysis; Anderson (1971) or Fuller 
(1976) for estimation and large sample theory; Nerlove, Grether, and Car- 
valho (1979) and Harvey (1981a, b), for practical aspects of fitting time series 
by autoregressive and moving-average models; Whittle (1983) for the theory 
of prediction; Granger and Newbold (1977) for the more practical aspects of 
prediction; and Brillinger (1975) for the estimation of the spectral density. 

In Section 5.1 we shall define stationary time series and the autocovariance 
function and spectral density of stationary time series. In Section 5.2 autore- 
gressive models will be defined and their estimation problems discussed. In 
Section 5.3 autoregressive models with moving-average residuals will be de- 
fined. In Section 5.4 we shall discuss the asymptotic properties of the LS and 
ML estimators in the autoregressive model, and in Section 5.5 we shall discuss 
prediction briefly. Finally, in Section 5.6 we shall discuss distributed-lag 
models. 


5.1 Introduction 
5.1.1 Stationary Time Series 


A time series is a sequence of random variables (y,), t — 0,31, € 2, . . . . We 
assume Ey, = 0 for every t. (If Ey, # 0, we must subtract the mean before we 
subject it to the analysis of this chapter.) We say a sequence (y,) is strictly 
stationary if the joint distribution of any finite subset Yy, Yas» . . , y, de- 
pendsonly ont, — ti, h3 —£,,. . . , tg - t,andnotont,. Wesay a sequence is 
weakly stationary if Ey, y, depends only on |t — s| and not on t. If a process is 
strictly stationary and if the autocovariances exist, the process is weakly sta- 
tionary. 

In Section 5.2 through 5.4 we shall be concerned only with strictly station- 
ary time series. The distributed-lag models discussed in Section 5.6 are gener- 
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ally not stationary in either sense. Time series with trends are not stationary, 
and economic time series often exhibit trends. However, this fact does not 
diminish the usefulness of Section 5.2 through 5.4 because a time series may 
be analyzed after a trend is removed. A trend may be removed either by direct 
subtraction or by differencing. The latter means considering first differences 
{), — y,-1), second differences ((y, — ,.1) — (Yı-1 — Y;-2)), and so forth. 

There are three fundamental ways to analyze a stationary time series. First, 
we can specify a model for it, such as an autoregressive model (which we shall 
study in Section 5.2) or a combined autoregressive, moving-average model 
(which we shall study in Section 5.3). Second, we can examine autocovar- 
lances Ey, y, 4, A= 0, 1, 2,. . . . Third, we can examine the Fourier trans- 
form of autocovariances called spectral density. In Sections 5.1.2 and 5.1.3 we 
shall study autocovariances and spectral density. 


5.1.2 Autocovariances 


Define y, = Ey,y, 44, h-0,1,2,.. . . A sequence (y) contains important 
information about the characteristics of a time series (y,). It is useful to 
arrange {y,} as an autocovariance matrix 


Jo n 2^7 * rA 

h Yo »^ ° * Yre 
x-|^ ^ " [| (5.1.1) 

h 

Yr-1 Yr-z2 " `A e 


This matrix is symmetric, its main diagonal line consists only of yọ, the next 
diagonal lines have only y, , and so on. Such a matrix is called a Toeplitz form. 


5.1.3 Spectral Density 


Spectral density is the Fourier transform of autocovariances defined by 


f()- € ye, -zsosm, (5.1.2) 
h 


provided the right-hand side converges. 
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Substituting e^ = cos A + i sin A, we obtain 
f(@) = Y, ».[cos (hw) — i sin (ho)] (5.1.3) 
hao 


= Y y, cos (hw), 


h=—œ 


where the second equality follows from y,=y_, and sin A = —sin (—4). 
Therefore spectral density is real and symmetric around w = 0. 
Inverting (5.1.2), we obtain 


Ya = (22)! | " e”ef(w) do (5.1.4) 


=! f cos (hoy (œ) do. 
0 


An interesting interpretation of (5.1.4) is possible. Suppose y, is a linear com- 
bination of cosine and sine waves with random coefficients: 


y= > [Šk COS (@,t) + C, sin (a, £)], (5.1.5) 
kei 


where w, = kn/n and {č} and (C) are independent of each other and inde- 
pendent across k with Eé, = EC, = 0 and Vë, = VC, = o£. Then we have 


n= 5 ci cos (cR), (5.1.6) 
k=1 


which is analogous to (5.1.4). Thus a stationary time series can be interpreted 
as an infinite sum (actually an integral) of cycles with random coefficients, and 
a spectral density as a decomposition of the total variance of y, into the 
variances of the component cycles with various frequencies. 

There is a relationship between the characteristic roots of the covariance 
matrix (5.1.1) and the spectral density (5.1.2). The values of the spectral 
density f(w) evaluated at T equidistant points of w in [— z, z] are approxi- 
mately the characteristic roots of X, (see Grenander and Szego, 1958, p. 65, 
or Amemiya and Fuller, 1967, p. 527). 
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5.2 Autoregressive Models 
5.2.1 First-Order Autoregressive Model 
Consider a sequence of random variables {y,},=0,+1,+2,... , which 
follows 

Ye = PV € (5.2.1) 
where we assume 
ASSUMPTION A. (€) £=0,+1,+2,...., are iid. with Ee, — 0 and 
Ee?= o? and independent of y, ,, y,.,,. . . 
ASSUMPTION B. |p| « 1. 
ASSUMPTION C. Ey, = 0 and Ey, y,,, = y, for all t. (That is, {y,} are weakly 
stationary.) 


Model (5.2.1) with Assumptions A, B, and C is called a stationary first-order 
autoregressive model, abbreviated as AR(1). 
From (5.2.1) we have 


5-1 
X 7 py. P ple. (5.2.2) 


But lim, ,. E( p*y, , Y. = 0 because of Assumptions B and C. Therefore we 
have 


Vt -2 ple, (5.2.3) 


which means that the partial summation of the right-hand side converges to y, 
in the mean square. The model (5.2.1) with Assumptions A, B, and C is 
equivalent to the model (5.2.3) with Assumptions A and B. The latter is called 
the moving-average representation of the former. 

A quick mechanical way to obtain the moving-average representation 
(5.2.3) of (5.2.1) and vice-versa is to define the lag operator L such that 


Ly, = y, Ly, y 2, . . . Then (5.2.1) can be written as 
(1—-pDy-e, (5.2.4) 
where 1 is the identity operator such that ly, = y,. Therefore 
y= - pL)", = ( Sov) €, (5.2.5) 
j= 


which is (5.2.3). 
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An AR(1) process can be generated as follows: Define y, as a random 
variable independent of €,,€;,. . . , with Ey; = 0 and Ey? = o?/(1 — p?). 
Then define y, by (5.2.2) after putting s = t. 

The autocovariances {y,} can be expressed as functions of p and oc? as 
follows: Multiplying (5.2.1) with y,_, and taking the expectation yields 


YR = PYn-1> h = 1, 2, se ee (5.2.6) 
From (5.2.1), E(y, — py, Y = Ee?, so that we have 
(1 + Po — 2py, = o?. (5.2.7) 
Solving (5.2.6) and (5.2.7), we obtain 
. PP - 
NT ag h=0,1,2,.... (5.2.8) 


Note that Assumption C implies y_, = Yp. 
Arranging the autocovariances in the form of a matrix as in (5.1.1), we 
obtain the autocovariance matrix of AR(1), 


1 p . . pr! 
o? p lp : pr 
2 = 1-59 l (5.2.9) 
pr 1 


Now let us examine an alternative derivation of $, that is useful for deriving 
the determinant and the inverse of 2, and is easily generalizable to higher- 


order processes. Define T-vectors y-—(y,y5...,yr) and ef= 
[(1—52)/95,€,€,...,er] anda TX T matrix 
-— py? 0 0 . - 0 
—p 1 0 
R, = 0 -P 1l (5.2.10) 
0 -p 1| 0 
0 0 —p 1 


Then we have 
R;y = ei. (5.2.11) 
But, because Ee? ef, = o?L, we obtain 


X,—o?Rj(Ri)'!, (5.2.12) 
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which can be shown to be identical with (5.2.9). Taking the determinant of 
both sides of (5.2.12) yields 


g?T 
Ix -— Tap (5.2.13) 
Inverting both sides of (5.2.12) yields 
-= E RR, (5.2.14) 
1 —p 0 0 0 
—p 1+p? -p 0 0 
_1{0 —p 1+)? 
EP 
—p l*p? —p 
0 —p 1 


By inserting (5.2.8) into (5.1.2), we can derive the spectral density of AR(1): 


g? ae 
[— X. ple the (5.2.15) 


= e? < ioh © -wy | 
[ET }+ È (e ) 


f(@) = 


5.2.2 Second-Order Autoregressive Model 


A Stationary second-order autoregressive model, abbreviated as AR(2), is 
defined by 


Ve Bea t PrYr-a + & t=0,41,42,..., (5.2.16) 
where we assume Assumptions A, C, and 
ASSUMPTION B’. The roots of z? — p,z — p, = 0 lie inside the unit circle. 
Using the lag operator defined in Section 5.2.1, we can write (5.2.16) as 
(1 ~p, L> p; Ly, = €. (5.2.17) 
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Hence, 
(1 — 44 LXI — uj Dy, = €, (5.2.18) 


where 4, and u, are the roots of z? — p,z — p; = 0. Premultiplying (5.2.18) by 
(1 — uL) (1 — u LY ', we obtain 


y= (X MU D ati) €, (5.2.19) 


J=0 


Convergence in the mean-square sense of (5.2.19) is ensured by Assumption 
B’. Note that even if 4, and uz, are complex, the coefficients on the €,—, are 
always real. 

The values of p, and p; for which the condition |u,| |42| < 1 is satisfied 
correspond to the inner region of the largest triangle in Figure 5.1. In the 
region above the parabola, the roots are real, whereas in the region below it, 
they are complex. 

The autocovariances may be obtained as follows: Multiplying (5.2.16) by 
y,-, and taking the expectation, we obtain 


Yı = Pio + P271- (5.2.20) 
Squaring each side of (5.2.16) and taking the expectation, we obtain 
Yo = (pi + p3)¥0 + 2p p27, + 0. (5.2.21) 


Figure 5.1 Regions of the coefficients in AR(2) 
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Solving (5.2.20) and (5.2.21) for yọ and y,, we obtain 


o*(p, — 1) 
= oo 5.2.22 
^ 7 TF pi - 0 - 2] (6.2.22) 


and 
y= — n 
(1+ oot — Q0 — oy] 
Next, multiplying (5.2.16) by y,- and taking the expectation, we obtain 
Ya = Ph- + PY- ABD. (5.2.24) 


Note that (y,) satisfy the same difference equation as (y,) except for the 
random part. This is also true for a higher-order autoregressive process. Thus, 
given the initial conditions (5.2.22) and (5.2.23), the second-order difference 
equation (5.2.24) can be solved (see Goldberg, 1958) as 


(5.2.23) 


ui 
hu 
ui 
HT ba 
=n TAY: — uyo) + HY] if u Smg. 
If u, and p; are complex, (5.2.25) may be rewritten as 


_ r*—Ty, sin k8 — yor sin (h — 1)0] 
sin 0 


Yr = (V1 — H270) (5.2.25) 


(Yı — Ho) if uu 


Yh , (5.2.26) 
where 4, = re and ui; = re~”. 

Arranging the autocovariances given in (5.2.25) in the form of (5.1.1) yields 
the autocovariance matrix of AR(2), denoted £,. We shall not write it explic- 
itly; instead, we shall express it as a function of a transformation analogous to 


(5.2.10). If we define a T-vector e$ = (a,y,, à;J1 + à3y1, €3, E4,» - . €r) 
anda T X T matrix 
a 0 0 0 
à; 4 0 
=A —p, 1d 
R; = 0 “Po TP 1 0 , (5.2.27) 
1 0 
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we have 
Ry = €6). (5.2.28) 


Now, if we determine 4,, a;, and a, by solving V(a,y,) =o’, 
V(azy, + a3y;) = 0°, and E[a,y,(a,y, + a3y;)] = 0, we have Fes, €G) = c?l. 
Therefore we obtain from (5.2.28) 

2, = o’ RI (Ri). (5.2.29) 


Higher-order autoregressive processes can be similarly handled. 


5.2.3 pth-Order Autogressive Model 


A stationary pth-order autoregressive process, abbreviated as AR(p), is de- 
fined by 


p 
*» Y PY t €. t=0,+1,42,..., (5.2.30) 
jet 
where we assume Assumptions A and C and 


ASSUMPTION B". The roots of 249 pz? = 0, py = — 1, lie inside the unit 
circle. 


A representation of the T X T autocovariance matrix 2, of AR(p) analo- 
gous to (5.2.12) or (5.2.29) is possible. The j, kth element (j,k =0,1,..., 
T — 1) of Z;'! can be shown to be the coefficient on £/C* in 


T-1 p p pal Pp P 
ESQ 2 pU — Y, OU Y ptt? Y pCT7, 
=o io i= j=0 j= 0 
where pọ = — | (see Whittle, 1983, p. 73). 

We shall prove a series of theorems concerning the properties of a general 
autoregressive process. Each theorem except Theorem 5.2.4 is stated in such a 
way that its premise is the conclusion of the previous theorem. 


THEOREM 5.2.1. (y) defined in (5.2.30) with Assumptions A, B", and C can 
be written as a moving-average process of the form 


Yı = $,,-, E olco, (5.2.31) 


j= 
where (e,) are i.i.d. with Ee, = 0 and Ee? = a?. 
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Proof. From (5.2.30) we have 
P 
| (i- m Y= €, (5.2.32) 
j=i 
where 4,, 45,. . . . , Hp are the roots of E2 o pjz^ 7. Therefore we have 


Pp o 
= KLK | | e. 5.2.33 
v= [fy n): saam 


Equating the coefficients of (5.2.31) and (5.2.33), we obtain 


by = 1 (5.2.34) 
=u tut... +u 
h= > Hill 
peiejel 
P= NX un. My 
peizhe, zia! 
Therefore 
oo 1 p 
z(1- + at.. = Y <e, 5.2.35 
PLZ (1 Hm + uis ) L— Av ( ) 
where 4m = maxl|b uo - . . - , ull. 


THEOREM 5.2.2. Let (y,) be any sequence of random variables satisfying 
(5.2.31). Then 


È I» < 9o, (5.2.36) 


where y, = EY Vith- 
Proof. We have 
w= pit it...) 
rı =a? (bob: + di); . . .) 
ya = opoh: + 610, - . . ), and so on. 
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Therefore 


[I a 2 
> lral = 9? (š ij) ; (5.2.37) 
h=0 j- 


from which the theorem follows. 


THEOREM 5.2.3. Let (y,) be any stationary sequence satisfying (5.2.36). 
Then the characteristic roots ofthe T X T'autocovariance matrix £ of ( y,) are 
bounded from above. 


Proof. Letx = (xg, X1, . . . , Xr-1 Y be the characteristic vector of È corre- 
sponding to a root A. That is, 

Ex = Ax. (5.2.38) 

Suppose |x| = max [Ixel ib . . . , |X7-1]. The (t + 1)st equation of (5.2.38) 


can be written as 
yeXo + YX H.-F yox ob yp Xr = My. (5.2.39) 


Therefore, because 4 = 0, 


Wel + Ivcilbt- Hll. o + Yr- E A (5.2.40) 
Therefore 
T 
2 VY iy, 24 (5.2.41) 


from which the theorem follows. 

The premise of the next theorem is weaker than the conclusion of the 
preceding theorem. In terms of the spectral density, the premise of Theorem 
5.2.4 is equivalent to its existence, and the conclusion of Theorem 5.2.3 to its 
continuity. 


THEOREM 5.2.4. Let {y,} be any sequence of random variables satisfying 


=F be, Yd, 5.2.42 
Yı PL PEZ ( ) 


where {€,} are i.i.d with Ee, = 0 and Ee?= c?. Then 


lim y, — 0. (5.2.43) 


Note that (5.2.31) implies (5.2.42). 
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Proof. The theorem follows from the Cauchy-Schwartz inequality, 
namely, 


yà = 0* (6o, t+ PPri t+. . F (5.2.44) 


" » Hy à. 


5.3 Autogressive Models with Moving-Average Residuals 


A stationary autoregressive, moving-average model is defined by 


P qd 
2; 59-7 È Bie Po = fo7—1, t=0, +1, +2,. m 
j- j-0 
(5.3.1) 
where we assume Assumptions A, B", C, and 
ASSUMPTION D. The roots of 2/_, fj,z* / = 0 lie inside the unit circle. 


Such a model will be called ARMA(p, q) for short. 
We can write (5.3.1) as 
AL)y, = Bue. (5.3.2) 
where øL) = 279 p,L/ and f(L) = Ef ., AL. Because of Assumptions B" 
and C, we can express y, as an infinite moving average 
yı = PDAL, = ALE, (5.3.3) 


where Q(L)) = 37. $; L. Similarly, because of Assumption D, we can express 
y, aS an infinite autoregressive process 


LW = B (Gp, = €, (5.3.4) 


where WL) = Zj-o v;L. 
The spectral density of ARMA(p, q) is given by 


flo) = g? E- TETELA (5.3.5) 


where |z|? = zz for a complex number z with z being its complex conjugate. 
Note that (5.3.5) is reduced to (5.2.15)in the special case of AR(1). We also see 
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from (5.3.5) that the spectral density of a moving-average model is, except for 
o?, the inverse of the spectral density of an autoregressive model with the same 
order and the same coefficients. Because the spectral density of a stationary 
process approximately corresponds to the set of the characteristic roots of the 
autocovariance matrix, as was noted earlier, we can show that the autocovar- 
iance matrix of a moving-average model is approximately equal to the inverse 
of the autocovariance matrix of the corresponding autoregressive model. We 
shall demonstrate this for the case of MA(1). 
Consider an MA(1) model defined by 


Ya €& Peu (5.3.6) 


where |p| € 1 and (e) are i.i.d. with Ee, = 0 and Ve, = g?. The T X Tautoco- 
variance matrix is given by 


I+p? -p 0 > 0 0 
—p 1 +p? 
Zp -a7} ° O . (5.3.7) 
. 1 tp? —p 
0 —p 1+)? 


We wish to approximate the inverse of £q). If we define a T-vector f] such that 
its first element is e, — peg — (1 — p”)'e, and the other T — 1 elements are 
zeroes, we have 


y=Re+n, (5.3.8) 
where R, is given by (5.2.10). Therefore! 

Xy = RR. (5.3.9) 
But we can directly verify 

RR; = RiR,. (5.3.10) 
From (5.2.12), (5.3.9), and (5.3.10), we conclude 

o’ = Rp (Ri)! (5.3.11) 

=b 


Whittle (1983, p. 75) has presented the exact inverse of 2). The j, kth 
element (j, k=0, 1,. . . , T— 1) of Za}, denoted Z5, is given by 
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p (9 0*9 = pitt gt — pT) 
X mi pp — pyp- Fh — ptt) ° (5.3.12) 


However, the exact inverse is complicated for higher-order moving-average 
models, and the approximation (5.3.11) is attractive because the same inverse 
relationship is also valid for higher-order processes. 

Anderson (1971, p. 223) has considered the estimation of the parameters of 
ARMA(p, q) without the normality assumption on (e,), and Box and Jenkins 
(1976) have derived the maximum likelihood estimator assuming normality 
in the autoregressive integrated moving-average process. This is the model in 
which the sequence obtained by repeatedly first-differencing the original 
series follows ARMA( p, q). The computation of MLE can be time-consuming 
because the inverse of the covariance matrix of the process, even if we use the 
approximation mentioned in the preceding paragraph, is a complicated non- 
linear function of the parameters (see Harvey, 1981b, for various computa- 
tional methods). 


5.4 Asymptotic Properties of Least Squares and Maximum Likelihood 
Estimator in the Autoregressive Model 


We shall first consider the least squares (LS) estimation of the parameters p’s 


and c? of an AR(p) model (5.2.30) using T observations yj, yj, . . . , yr. In 
vector notation we can write model (5.2.30) as 
y=Yprte, (5.4.1) 
where y= (pi; Ypr2» ce ety yr), e= (E541 3 €p2: t5 5 er), p= 
(A, P2» 2...3 Py and 
Yp Yp-1 . . . A 
Vp+t Yp ° . . Ja 
ys] | . . 
Yr-1 Yr-2 " ^" " Jr-p 


The LS estimator f of p is defined as p = (Y'Y) !Y'y. In the special case of 
p= 1, it becomes 


T 
> Vr-1 Nt 
p= —_. (5.4.2) 


PUT 


Time Series Analysis 173 


Model (5.4.1) superficially looks like the classical regression model (1.1.4), 
but it is not because the regressors Y cannot be regarded as constants. This 
makes it difficult to derive the mean and variance of p for general p. However, 
in the case of p = 1, the exact distribution of f can be calculated by direct 
integration using the method of Imhof (1961). The distribution is negatively 
skewed and downward biased. Phillips (1977) derived the Edgeworth expan- 
sion [up to the order of O(T—')] of the distribution of ? in the case of p = 1 
assuming the normality of (e,) and compared it with the exact distribution. He 
found that the approximation is satisfactory for p — 0.4 but not for p — 0.8. 

Inthe general AR(p) model we must rely on the asymptotic properties ofthe 
LS estimator. For model (5.4.1) with Assumptions A, B", and C, Anderson 
(1971, p. 193) proved 


VTI — p) > NO, o?%,"), (5.4.3) 


where Z, is the p X p autovariance matrix of AR(p).? We can estimate o? 
consistently by 


ag? = T-Xy — Ypy(y — Yp). (5.4.4) 


Because plim;. .,, T^! Y'Y = Z,, the distribution of f may be approximated 
by N[p, c?(Y'Y)^!]. Note that this conclusion is identical to the conclusion we 
derived for the classical regression model in Theorem 3.5.4. Thus the asymp- 
totic theory of estimation and hypothesis testing developed in Chapters 3 and 
4 for the classical regression model (1.1.4) can be applied to the autoregressive 
model (5.4.1). There is one difference: In testing a null hypothesis that speci- 
fies the values of all the elements of p, 72>! need not be estimated because it 
depends only on p [see (5.4.16) for the case of p = 1]. 

We shall consider the simplest case of AR(1) and give a detailed proof of the 
consistency and a sketch of the proof of the asymptotic normality. These are 
based on the proof of Anderson (1971). 


From (5.4.2) we have 
T 
> Vi—1€ 
p-p=—— (5.4.5) 


We shall prove consistency by showing that 7"! times the numerator con- 
verges to 0 in probability and that T^! times the denominator converges to a 
positive constant in probability. 

The cross product terms in (22, y,_,€,)* are of the form y,V,+€/4.1€4145- 
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But their expectation is 0 because e,, , ,, is independent of y,y,4,€,4,- Thus 
14 2 T-1 at 
De 2 »-«) = aT Tp (5.4.6) 
Therefore, by Theorem 3.2.1 (Chebyshev), we have 
plim 7. D y,-1€, = 0. (5.4.7) 


Putting €, = y, — py,—1 in (5.4.7), we have 


_ 12 
plim T 2 Xi — PYr-1) = 0. (5.4.8) 
—o i= 


By Theorem 3.3.2 (Kolmogorov LLN 2), we have 


plim 7. > Wi T py Y. = 0. (5.4.9) 
By adding (5.4.9) and 2p times (5.4.8), we have 
; lS- aid 2 
plim (+5 y-p TÀ E» -—g?, (5.4.10) 
Therefore 


g? 2 

DD aim bts 
plim 7. T2 >) v= ptT 1-5 plim z (yi — y7). (5.4.11) 
But the last term of (5.4.11) is 0 because of (3.2.5)— generalized Chebyshev’s 


inequality. Therefore 


plim =. r2 5 y= TTA — (5.4.12) 
The consistency of f follows from (5.4.5), (5.4.7), and (5.4.12) because of 
Theorem 3.2.6. 

Next consider asymptotic normality. For this purpose we need the follow- 
ing definition: A sequence {v,} is said to be K-dependent if (v, , v,, . . . , v) 
are independent of (v, , v, , . . . , Vs„) for any set of integers satisfying 


t Xt...«t, Xs <<. ..<5, and t, Ks. 


To apply a central limit theorem to ZZ, v, split the T observations into S 
successive groups with M + K observations [so that S(M + K) = T] and then 
in each group retain the first M observations and eliminate the remaining K 
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observations. If we choose S and M in such a way that S — œ, M — œ, and 
S/T — 0 as T — ©, the elimination of K observations from each of the S 
groups does not matter asymptotically. 


We can write 
12 1 2 
VT 2 Yr-1& = VT 2 UN; + Avr; (5.4.13) 
t= {= 


1 
where vy, = €, {Lo p*e, , ., and Ayr = VF Ea (€ Zh-n+1 P*E,-1-5)- But we 


have 


1 T T bad e 
EMr-7 Y > p° EEE E-E 15» (5.4.14) 


t=2 t=2 s=N+1 w=N+1 
_o(T-1) & y 
T hs P ' 
Therefore Ay; can be ignored for large enough N. (Anderson, 1971, Theorem 
7.7.1, has given an exact statement of this.) We can show that for a fixed N, 


N 
Evy,=0, Ev}, — o0* Y p”, and Evy,ty,=0 for tr. 
s=0 
Moreover, vy, and vy 442 are independent for all t. Therefore (vy) for each N 
are (N + 1)-dependent and can be subjected to a central limit theorem (see 
Anderson, 1971, Theorem 7.7.5). Therefore 


12 c^ 
JT 2 Y6 > N (o. 7) . (5.4.15) 
Combining (5.4.12) and (5.4.15), we have 
VT (p— p) ^ NO, 1 — p?). (5.4.16) 


Now we consider the MLE in AR(1) under the normality of (e,). The 
likelihood function is given by 


L ^ Qny TAX,|!? exp [-(1/2) Zi! y], (5.4.17) 
where |Z,| and £7! are given in (5.2.13) and (5.2.14), respectively. Therefore 


log L = -i log (2x) — 1 log a? (5.4.18) 


1 l 
-— — g2)-— — 
*jlog(1-2)- 550. 
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where Q=(1 +p?) EZ, y? — pP OR + y3) — 2p Eia Y-Y Putting 
ô log L/da? = 0, we obtain 


o= T Q. (5.4.19) 


Inserting (5.4.19) into (5.4.18) yields the concentrated log likelihood function 
(aside from terms not depending on the unknown parameters) 


log L* — - log (2) — 2 log Q + 5 log (1 — p?). (5.4.20) 


Setting ô log L*/dp = 0 results in a cubic equation in p with a unique real root 
in the range [— 1, 1]. Beach and Mackinnon (1978) have reported a method of 
deriving this root. 

However, by setting dQ/dp = 0, we obtain a much simpler estimator: 


T 
2 JXiaiX 
=. (5.4.21) 
>» 
t-2 
We call it the approximate MLE. Note that it is similar to the least square 
estimator f, given in (5.4.2), for which the range of the summation in the 
denominator is from t = 2 to T. If we denote the true MLE by fu, we can 
easily show that VT (f, — p) and VT (f,, — p) have the same limit distribution 
by using the result of Section 4.2.5. 


5.5 Prediction 


In the classical regression model with nonstochastic regressors, the problem of 
predicting a future value of the dependent variable and the problem of esti- 
mating the regression parameters can be solved in a unified way. For example, 
the least squares predictor is the best linear unbiased predictor, as shown in 
Section 1.6. But; in time series models such as AR or MA, in which the 
regressors are stochastic, the problem of optimal prediction is not a simple 
corollary of the problem of optimal estimation because (1.6.3) no longer 
holds. In view of this difficulty, what we usually do in practice is to obtain the 
optimal predictor on the assumption that the parameters are known and then 
insert the estimates obtained by the best known method, such as maximum 
likelihood, mechanically into the formula for the optimal predictor. 
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There is an extensive literature on the theory concerning the optimal pre- 
diction of a stationary time series when the parameters of the process are 
assumed known (see Whittle, 1983, and the references in it). 

In the simplest case of AR(1), the optimal predictor of y,,, given 


Yis Vi-1  » - Is obtained as follows: From (5.2.2) we obtain 
n-i . 
Yean 7 PV + 2 P'Ei+n-j: (5.5.1) 
j= 
Taking the conditional expectation of both sides given y,, y, ,,. . . , we 
obtain 
Dron = Eira Yes Xii - 27 9». (5.5.2) 


In the mixed autoregressive, moving-average model (5.3.1), the optimal 
predictor of y,,,, given y,, ¥;-1,- . . is obtained as follows: From (5.3.3) we 
obtain 


Ven = 2 Pkëttn-k- (5.5.3) 
Putting €,4, = €4n-1=- - -= E41 = O in (5.5.3), we obtain the optimal pre- 
dictor 

$e. P $,..€- (5.5.4) 


Finally, putting €,_; = $^ !(L2)y,.. in (5.5.4), we obtain 
$n = $L) 2 jtn DY. 
j= 


(See Whittle, 1983, p. 32.) 


5.6 Distributed-Lag Models 

5.6.1 The General Lag 

If we add an exogenous variable to the right-hand side of Eq. (5.3.2), we obtain 
AL) y, = ax, + B(L)e,. (5.6.1) 


Such a model is called a distributed-lag model by econometricians because 
solving (5.6.1) for y, yields 


y = ap (Ly + p (Df), (5.6.2) 
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where p~'(L) describes a distribution of the effects of the lagged values of x, 
(that is, x,, X, ,, X, 2, . . .) on y,. There is a vast amount of econometric 
literature on distributed-lag models. The interested reader is referred to the 
aforementioned works by Fuller (1976); Nerlove, Grether, and Carvalho 
(1979); and Dhrymes (1971). A model of the form (5.6.2) also arises as a 
solution of the rational expectations model (see Shiller, 1978; Wallis, 1980). 
We shall discuss briefly two of the most common types of distributed-lag 
models: the geometric lag and the Almon lag. 


5.6.2 The Geometric Lag 
The geometric lag model is defined by 


= J . 
y, = a P AXi + t. (5.6.3) 
This model was first used in econometric applications by Koyck (1954) 
(hence, it is sometimes referred to as the Koyck lag) and by Nerlove (1958). 
Griliches’s survey article (1967) contains a discussion of this model and other 
types of distributed-lag models. This model has the desirable property of 
having only two parameters, but it cannot deal with more general types of lag 
distribution where the effect of x on y attains its peak after a certain number of 
lags and then diminishes. 
By inverting the lag operator Xiao XI, we can write (5.6.3) equivalently as 


Yi = Ay, textu, (5.6.4) 


where u, = v, — Av, ,. If {u,} are i.i.d., the estimation of A and ais not much 
more difficult than the estimation in the purely autoregressive model dis- 
cussed in Section 5.4. Under general conditions the least squares estimator 
can be shown to be consistent and asymptotically normal (see Crowder, 1980). 
For a discussion of the estimation of 4 and a when (v,), rather than {u,}, are 
i.i.d., see Amemiya and Fuller (1967) and Hatanaka (1974) (also see Section 
6.3.7 for further discussion of this model). 


5.6.3 The Almon Lag 
Almon (1965) proposed a distributed-lag model 


N 
y= Y put us (5.6.5) 
j=l 
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in which f,, £,,. . . , By lie on the curve of a gth-order polynomial; that is, 
B= tójtóPct...tàj j—1,2,...,N. (5.6.6) 
By defining vectors $ = (£1, 5,. . . , By)’ and d= (do, ôi,- . . , dz)’, we 
can write (5.6.6) in vector notation as 
p=J6, (5.6.7) 
where 
1 1 1 1 
1 2 2 2¢ 
j= . 
L N N? . * NE 


The estimation of ó can be done by the least squares method. Let X be a 
TXN matrix, the ¢,jth element of which is x,,.;. Then 
é= (J'X'XJ) !J'X'y and B- Jó. Note that Bi is a special case of the con- 
strained least squares estimator (1.4.11) where R = J andc=0. 

By choosing N and g judiciously, a researcher can hope to attain both a 
reasonably flexible distribution of lags and parsimony in the number of pa- 
rameters to estimate. Amemiya and Morimune (1974) showed that a small 
order of polynomials (q = 2 or 3) works surprisingly well for many economic 
time series. 

Some researchers prefer to constrain the value of the polynomial to be 0 at 
j^ N- 1. This amounts to imposing another equation 


dp - Ó(N t 1) + Ó(N 1 t... cFó(N- 1*—0 (5.6.8) 


in addition to (5.6.6). Solving (5.6.8) for dy and inserting it into the right-hand 
side of (5.6.7) yields the vector equation 


p J*6*, (5.6.9) 
where ó* = (0,,6,,. . . , 0,)' and J* should be appropriately defined. 


Exercises 


1. (Section 5.2.1) 
Prove that model (5.2.1) with Assumptions A, B, and C is equivalent to 
model (5.2.3) with Assumptions A and B. 
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. (Section 5.2.1) 


Show that the process defined in the paragraph following Eq. (5.2.5) is 
AR(1). 


. (Section 5.3) 


Find the exact inverse of the variance-covariance matrix of MA(1) using 
(5.3.12) and compare it with the variance-covariance matrix of AR(1). 


. (Section 5.3) 


In the MA(1) process defined by (5.3.6), define yf = ef — p !e? ,, where 
{e*} are i.i.d. with Ee? = 0, Ve* = p?o?. Show that the autocovariances of 
y, and y? are the same. 


. (Section 5.4) 


If X, Y, and Z are jointly normal and if X is independent of either Y or Z, 
then EXYZ = EXEYZ (Anderson, 1958, p. 22). Show by a counterexam- 
ple that the equality does not in general hold without the normality 
assumption. 


. (Section 5.4) 


Show that VT (5, — p) and VT (py, — p) have the same limit distribution. 


. (Section 5.4) 


In the AR(1) process defined by (5.2.1), define the first differences 
y? — yv — Ye-1 and derive plimz.,, Z5, yt a yy Zi-s ye. 


. (Section 5.4) 


In the AR(2) process defined by (5.2.16), derive plim;... EL, y,-1y,/ 
Zla Yiu: 


. (Section 5.5) 


Derive (5.5.2) from the general formula of (5.5.5). 


. (Section 5.5) 


In the MA(1) process defined by (5.3.6), obtain the optimal predictor of 
Vr+n GIVEN yj Y-i +» 


. (Section 5.6) 


Show that (5.6.7) can be written in the equivalent form Q’f = 0, where Q 
is an NX(N—q-— I) matrix such that [Q, J] is nonsingular and 
Q’J — 0. Find such a Q when N = 4 and q — 2. 


6 Generalized Least Squares Theory 


One of the important assumptions of the standard regression model (Model 1) 
is the assumption that the covariance matrix of the error terms is a scalar times 
the identity matrix. In this chapter we shall relax this assumption. In Section 
6.1 the case in which the covariance matrix is known will be considered. It is 
pedagogically useful to consider this case before we consider, in Section 6.2, a 
more realistic case of an unknown covariance matrix. Then in Sections 6.3 
through 6.7 various ways of specifying the covariance matrix will be con- 
sidered. 


6.1 The Case of a Known Covariance Matrix 
6.1.1 Model6 


The linear regression model we shall consider in this chapter, called Model 6, 
is defined by 


y ^ Xf +u, (6.1.1) 


where X is a T X K matrix of known constants with rank K(s T), £ is a 
K-vector of unknown parameters, and u is a T-vector of random variables with 
Eu = 0 and Ew’ = X, a known T X T positive-definite covariance matrix. (In 
Section 6.1.5 we shall consider briefly the case of a singular covariance ma- 
trix.) We shall write the t,sth element of X as o,,. 


6.1.2 Generalized Least Squares Estimator 


Because X is positive definite, we can define £- "2 as HD^'?2H', where H is an 
orthogonal matrix consisting of the characteristic vectors of 2, D is the diago- 
nal matrix consisting of the characteristic roots of X, and D~'/ is obtained by 
taking the (—4)th power of every diagonal element of D (see Theorem 3 of 
Appendix 1). Premultiplying (6.1.1) by Z^! yields 


y*-— X*f 4 u", (6.1.2) 
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where y* = XY "y, X* = 2~'X, and u* = X'y. Note that Eu* = 0 and 
Eu*u*’ = I, so (6.1.2) is Model 1 except that we do not assume the elements 
of u* are iid. here. The generalized least squares (GLS) estimator of fl in 
model (6.1.1), denoted Bg, is defined as the least squares estimator of f in 
model (6.1.2); namely, 


Bs = (X*'X*)X*'y* (6.1.3) 
= (X'X-!X)^'X'E-ly. 
Using the results of Section 1.2, we obtain that Ef; — f and 
Vfs = (X'E-X)-. (6.1.4) 


Furthermore, GLS is the best linear unbiased estimator of Model 6. (Note that 
in Section 1.2.5 we did not require the independence of the error terms to 
prove that LS is BLUE.) 


6.1.3 Efficiency of Least Squares Estimator 


Itiseasy to show that in Model 6 the least squares estimator B is unbiased with 
its covariance matrix given by 


VB = (X'X)X'XX(X'X)^!. (6.1.5) 
Because GLS is BLUE, it follows that 
(X/X)-1X/ZX(X'X)7! z (X'E-!X)-!, (6.1.6) 


which can also be directly proved. There are cases where the equality holds in 
(6.1.6), as shown in the following theorem. 


THEOREM 6.1.1. Let X’X and È both be positive definite. Then the follow- 
ing statements are equivalent. 

(A) (X'X)*'!X'XX(X'X)! = CX’ EX) 

(B) XX = XB for some nonsingular B. 

(O (X'X)'x'-(X'x'"'X)'X' x! 

(D) X= HA for some nonsingular A where the columns of H are K 
characteristic vectors of 2. 

(E X'ZZ =0 for any Z such that Z'X = 0. 

(FP £= XIX’ + ZOZ + a?LforsomeT' and O and Zsuch that Z'X = 0. 


Proof. We show that statement A — statement B — statement C: 
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Statement A => X'ZX = X'X(X'X^!X)'!X'X 

=> XIN — E ?X(X’E-'X)“!X’E-7]E'2K = 0 

=> £!?X = X-!2YXB for some B using Theorem 14 of Appen- 

dix 1 

=> XX = XB for some B 

=> (X'X)' X'XX =B 

= B is nonsingular because X' XX is nonsingular 

— statement B 

=> X'£X-X- (B^)'!X'X and X'2^! = (B^)-*'X' 

—» statement C. 
Statement C — statement D can be easily proved using Theorem 16 of Ap- 
pendix 1. (Anderson, 1971, p. 561, has given the proof.) Statement D — 


statement A and statement B — statement E are straightforward. To prove 
statement E = statement B, note 
statement E = X'Z(Z, X) = (0, X'ZX) 
=> X'X = (0, X'EX)IZ(Z' ZD, X(X'X)!]' 
= X'EZX(X'X) !X' using Theorem 15 of Appendix 1 
=> statement B because X'ZX(X'X)^! is nonsingular. 


For a proof ofthe equivalence of statement F and the other five statements, see 
Rao (1965). 


There are situations in which LS is equal to GLS. For example, consider 
y-X(ftv)tu, (6.1.7) 


where f is a vector of unknown parameters and u and v are random variables 
with Eu = 0, Ev = 0, Eww’ = c?L, Evv’ =T, and Euv’ = 0. In this model, 
statement F of Theorem 6.1.1 is satisfied because E(Xv + u(Xv + uy = 
XIX’ + c?I. Therefore LS = GLS in the estimation of £. 

There are situations in which the conditions of Theorem 6.1.1 are asymp- 
totically satisfied so that LS and GLS have the same asymptotic distribution. 
Anderson (1971, p. 581) has presented two such examples: 


Y= By t Bot + Bat? +... + Beth + u, (6.1.8) 
and 
y, = fy cos Att + ff; cos Aat t+. . . + Bg cos Agt + ts, (6.1.9) 


where in each case (14) follow a general stationary process defined by (5.2.42). 
[That is, take the y, of (5.2.42) as the present 1,.] 
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We shall verify that condition B of Theorem 6.1.1 is approximately satisfied 
for the polynomial regression (6.1.8) with K = 3 when (1) follow the station- 
ary first-order autoregressive model (5.2.1). Define X = (x,, X2, x4), where the 
tth elements of the vectors x,, x;, and x, are xy = 1, Xy = t, and xX = t?, 
respectively. Then it is easy to show ZX = XA, where & is given in (5.2.9) and 


j| 0-7» 0 —2p 
A=3 0 (1 — py 0 
d 0 


The approximate equality is exact except for the first and the last rows. 

We have seen in the preceding discussion that in Model 6 LS is generaly not 
efficient. Use of LS in Model 6 has another possible drawback: The covariance 
matrix given in (6.1.5) may not be estimated consistently using the usual 
formula o?(X’X)-!. Under appropriate assumptions we have plim 6? = 
lim 7^! tr MZ, where M = I — X(X'X)*!X'. Therefore plim ó"(X'X) ! is 
generally different from (6.1.5). Furthermore, we cannot unequivocally de- 
termine the direction of the bias. Consequently, the standard 1 and F tests of 
linear hypotheses developed in Chapter 1 are no longer valid. 


6.1.4 Consistency of the Least Squares Estimator 


We shall obtain a useful set of conditions on X and Z for the consistency of LS 
in Model 6. We shall use the following lemma in matrix analysis. 


LEMMA 6.1.1. Let A and B be nonnegative definite matrices of size n. Then 
tr (AB) = A,(A) tr B, where 4 (A) denotes the largest characteristic root of A. 


Proof. Let H be a matrix such that H'AH = D, diagonal, and H’H = I. 


Then, tr (AB) = tr (H'AHH'BH) = tr DQ, where Q = H'BH. Let d; be the 
ith diagonal element of D and q; be of Q. Then 


tr (DQ) = Y dg; S max d; - Y, qu = A,(A) tr B. 
i=1 r i=] 
Now we can prove the following theorem. 


THEOREM 6.1.2. In Model 6 assume 
(A) A(X) bounded for all z, 
(B) A(X'X)— c. 

Then plim;... fl = £. 
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Proof. We have 
tr VB = tr [(X/X) X'XX(X/X)71] 
= tr [EX(X’X)-2X’] 
S A, (X) tr [X(X'X)?X'] by Lemma 6.1.1 
= AX) tr (X’X)7! 


But the last term converges to 0 because of assumptions A and B. 


Note that Theorem 3.5.1 is a special case of Theorem 6.1.2. One interesting 
implication of Theorems 6.1.2 and 5.2.3 is that LS is consistent if u follows a 
stationary time series satisfying (5.2.36). 

We have not proved the asymptotic normality of LS or GLS in this section 
because the proof would require additional specific assumptions about the 
generation of u. 


6.1.5 A Singular Covariance Matrix 


If the covariance matrix X is singular, we obviously cannot define GLS by 
(6.1.3). Suppose that the rank of Z is S < T. Then, by Theorem 3 of Appendix 
1, we can find an orthogonal matrix H = (H, , H,), where H, is T X Sand H, 
is T X (T — S), such that H7ZH, = A, a diagonal matrix consisting of the S 
positive characteristic roots of £, H,EH,, = 0, and H2XH, = 0. The premulti- 
plication of (6.1.1) by H’ yields two vector equations: 


iy = HLXf + Hju (6.1.11) 
and 
Hiy = H2Xf. (6.1.12) 
Note that there is no error term in (6.1.12) because EHjuu’H, = H32H, = 0 
and therefore Hu is identically equal to a zero vector. Then the best linear 
unbiased estimator of $ is GLS applied to (6.1.11) subject to linear constraints 
(6.1.12).! Or, equivalently, it is LS applied to 
A“?H ty = A77?HAIXB + ACU?H/u (6.1.13) 


subject to the same constraints. Thus it can be calculated by appropriately 
redefining the symbols X, Q, and c in formula (1.4.5). 
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6.2 The Case of an Unknown Covariance Matrix 


In the remainder of this chapter, we shall consider Model 6 assuming that X is 
unknown and therefore must be estimated. Suppose we somehow obtain an 
estimator X. Then we define the feasible generalized least squares (FGLS) 
estimator by 


B. =x Â xX Ê- y, (6.2.1) 


assuming $ is nonsingular. 

For fi, to be a reasonably good estimator, we should at least require it to be 
consistent. This means that the number of the free parameters that character- 
ize X should be either bounded or allowed to go to infinity at a slower rate than 
T. Thus one must impose particular structure on X, specifying how it depends 
on a set of free parameters that are fewer than T in number. In this section we 
shall consider five types of models in succession. For each we shall impose a 
particular structure on Z and then study the properties of LS and FGLS and 
other estimators of f. We shall also discuss the estimation of X. The five 
models we shall consider are (1) serial correlation, (2) seemingly unrelated 
regression models, (3) heteroscedasticity, (4) error components models, and 
(5) random coefficients models. ` 

In each of the models mentioned in the preceding paragraph, X is obtained 
from the least squares residuals à = y — X£, where f is the LS estimator. 
Under general conditions we shall show that f is consistent in these models 
and hence that X is consistent. Using this result, we shall show that f, has the 
same asymptotic distribution as fl; . 

In some situations we may wish to use (6.2. 1) as an iterative procedure; that 
is, given fl; we can calculate the new residuals y — Xf,, reestimate X, and 
insert it into the right-hand side of (6.2.1). The asymptotic distribution is 
unchanged by iterating, but in certain cases (for example, if y is normal) 
iterating will allow convergence to the maximum likelihood estimator (see 
Oberhofer and Kmenta, 1974). 


6.3 Serial Correlation 


In this section we shall consider mainly Model 6 where u follows AR(1) 
defined in (5.2.1). Most of our results hold also for more general stationary 
processes, as will be indicated. The covariance matrix of u, denoted È, , is as 
given in (5.2.9), and its inverse is given in (5.2.14). 
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6.3.1 Asymptotic Normality of the Least Squares Estimator 


As was noted earlier, the first step in obtaining FGLS is calculating LS. There- 
fore the properties of FGLS depend on the properties of LS. The least squares 
estimator is consistent in this model if assumption B of Theorem 6.1.2 is 
satisfied because in this case assumption A is satisfied because of Theorem 
5.2.3. In fact, Theorem 5.2.3 states that assumption A is satisfied even when u 
follows a more genera! process than AR(1). So in this subsection we shall prove 
only the asymptotic normality of LS, and we shall do so fora process of u more 
general than AR(1) but only for the case of one regressor (that is, K = 1) in 
Model 6. (Anderson, 1971, p. 585, has given the proof in the case of K 
regressors. He assumed an even more general process for u than the one we 
assume in the following proof.) 


THEOREM 6.3.1. Assume K = 1 in Model 6. Because X is a T-vector in this 
case, we shall denote it by x and its tth element by x,. Assume 


. XX 
(A) limz 4 ES 


(B) u, = Z;-o0,,.;, Erol; < *, where (ej) are iid. with Ee, — 0 and 
Ee = o?. 


Then VT (Ê — f) — N(0, cy?c,), where c; = lim 7... T~'x’ Euu'x. 


=c, #0. 


Proof. We need only prove T^ ?x'u — N(0, c;) because then the theorem 
follows from assumption A and Theorem 3.2.7 (iii) (Slutsky). We can write 


T 
F p Xda = >) X X $e; t = => x X. PE; (63.1) 


=A, +A). 


But V(4;) = T'x' xo (È n+ 1161). Therefore A; can be ignored if one takes 
N large enough. We have 


zx « Y dpt X €, P» Xj (6.32) 


tel j= VT fin mf 


€, X; 
+ s >) $ d 
=A, 4,4544. 
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But V(A,.) = T~'o?NM*(Z2.,|¢,)’, which goes to 0 as T — œ for a fixed N. 
The same is true for 4,4. The theorem follows by noting that Z4 96,x,. , 
satisfies the condition for x, in Theorem 3.5.3. 


6.3.2 Estimation of p 


Because X defined in (5.2.9) depends on c? only through a scalar multiplica- 
tion, fi; defined in (6.1.3) does not depend on c?. Therefore, in obtaining 
FGLS (6.2.1), we need to estimate only p. The most natural estimator of p is 


y hi,_ ti, 
p==2— (6.3.3) 


> By 


m2 
where f, = y, — x; B. The consistency of pis straightforward. We shall prove its 
asymptotic normality. 
Using u, = pu,—, t €, we have 


u€, + Ay 
VT- = Rents (6.3.4) 
T 2 u? + A, 
where 
a T ^ 
A = T à (B- by’ X,-1X;(B — B) + E p (B — BYx,-,u, (6.3.5) 
T ^ 
—fv 2 
> (B— By’ Xd, — JT ay (GB — 8)’x,-1] 
2p d. ^ 
-F A B- Âx- 
and 
14 a 2r a 
A T 2 [8 — B'x. P + T 2 (B — Bx, iu. (6.3.6) 


If we assume that lim 7—e» 7T- 'X’X is a finite nonsingular matrix, it is easy to 
show that both A, and A; converge to 0 in probability. For this, we need only 
the consistency of the LS estimator f and VT(f — f) = O(1) but not the 
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asymptotic normality. Therefore, by repeated applications of Theorem 3.2.7 
(Slutsky), we conclude that V/T(5 — p) has the same limit distribution as 


Hence the asymptotic normality of p follows from the result of Section 5.4. 


6.3.3 Feasible Generalized Least Squares Estimator 


Inserting f defined in (6.3.3) into Zy’ defined in (5.2.14) and then inserting the 
estimate of Z^! into (6.2.1), we obtain the FGLS estimator p. As noted 
earlier, g? drops out of the formula (6.2.1) and therefore need not be estimated 
in the calculation of fp. (However, a? needs to be estimated in order to 
estimate the covariance matrix of $p.) The consistency of fi. can be proved 
easily under general conditions, but the proof of the asymptotic normality is 
rather involved. The reader is referred to Amemiya (19732), where it has been 
proved that 2; has the same asymptotic distribution as fg under general 
assumptions on X when {u,} follow an autoregressive process with moving- 
average errors. More specifically, 


VT (B — B) — NIO, lim T(X'2-'X)7]. (6.3.7) 


In a finite sample, [3 may not be as efficient as bs. In fact, there is no 
assurance that fl. is better than the LS estimator f. Harvey (1981a, p. 191) 
presented a summary of several Monte Carlo studies comparing p with £. 
Taylor (1981) argued on the basis of analytic approximations of the moments 
of the estimators that the relative efficiency of fl vis-à-vis f crucially depends 
on the process of the independent variables. 


6.39.4 A Useful Transformation for the Calculation of the Feasible 
Generalized Least Squares Estimator 


Let R, be the matrix obtained by inserting p into the right-hand side of 
(5.2.10). Then we have by (5.2. 14) , — XRRR XX Ê Ry. Thus & can 
be obtained by least squares after transforming all the variables by R,. An- 
other, somewhat simpler transformation is defined by eliminating the first 
row of R,. This is called the Cochrane-Orcutt transformation (Cochrane and 
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Orcutt, 1949). The resulting estimator is slightly different from FGLS but has 
the same asymptotic distribution. 

If (uj) follow AR(2) defined by (5.2.16), the relevant transformation is given 
by (5.2.27), where a’s are determined by solving V(a,u,)=07, 
V(a4u, + a,u2) = o?, and E[a,u,(a,u, + a34;)] = 0. A generalization of the 
Cochrane-Orcutt transformation is obtained by eliminating the first two rows 
of R,. Higher-order autoregressive processes can be similarly handled. 


6.3.5 Maximum Likelihood Estimator 


Let us derive the MLE offi, p, and c? in Model 6, assuming that (1,) are normal 
and follow AR(1). The log likelihood function of the model has the same 
appearance as (5.4.18) and is given by 


T T 
log L = 7 log 27 3 


where Q —(1- p?)ZL,u2— p*(u? + uz) — 2p22,u,_,u, where we have 
written u, = y, — xf for short. Putting ô log L/óc? = 0, we obtain 


l 1 
2 — — 72) — — 
log c? + 2 log (1 — p?) 752 Q, (6.3.8) 


o= 7 Q. (6.3.9) 


Inserting (6.3.9) into (6.3.8), we obtain the concentrated log likelihood func- 
tion (aside from terms not depending on the unknown parameters) 


log L* — — 2 log Q + ; log (1 — p?). (6.3.10) 
For a given value of fl, p may be estimated either by the true MLE, which 
maximizes log L* using the method of Beach and MacKinnon (1978), or by 
theapproximate MLE, which minimizes log Q, as was noted earlierin Section 


5.4. Both have the same asymptotic distribution. Tbe formula for the approx- 
imate MLE is 


T 

Y, Oa 7 xii, 7 xif) 
MEGA a aS a 
2 (Y1 7 xi- 


Given y, the value of ff that maximizes log L* is clearly 
B((X'xUÁXx»gxXxEy (6.3.12) 


(6.3.11) 
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where Z is as given in (5.2.9). The approximate MLE of £, p, and o? are 
obtained by solving (6.3.9), (6.3.11), and (6.3.12) simultaneously. 

These equations defining the approximate MLE are highly nonlinear, so 
that the MLE must be obtained either by a search method or by an iteration. 
We have already described the common iterative procedure of going back and 
forth between (6.3.11) and (6.3.12), starting from the LS £. 

For the asymptotic normality of MLE in Model 6 where (1) follow a 
Stationary autoregressive, moving-average process, see Pierce (1971). 


6.3.6 Durbin-Watson Test 


In this subsection we shall consider the test of the hypothesis p = 0 in Model 6 
with {u,} following AR(1). The Durbin-Watson test statistic, proposed by 
Durbin and Watson (1950, 1951), is defined by 


T 
2 (à, — f, y. 
d==2 —__ 
> ú; 


tol 


, (6.3.13) 


where (12) are the least squares residuals. By comparing (6.3.13) with (6.3.3), 
we can show that 


d=2-—26+ T~). (6.3.14) 


From this we know that plim d = 2 — 2p, and the asymptotic distribution of d 
follows easily from the asymptotic distribution of derived in Section 6.3.2. 

To derive the exact distribution of d under the null hypothesis, it is useful to 
rewrite (6.3.13) as 


d= ove (6.3.15) 
where M = I — X(X'X)"!X' and A is a T X T symmetric matrix defined by 
1-1 0 >: |: 0 
-1 2 -1 0 - 0 
A-|. (6.3.16) 
0 0 -1 2 -1 
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Because MAM commutes with M, there exists an orthogonal matrix H that 
diagonalizes both (see Theorem 8 of Appendix 1). Letv,,¥2,. . . ,V¥y_,bethe 
nonzero characteristic roots of MAM. Then we can rewrite (6.3.15) as 


= (6.3.17) 


where (¢,} are i.i.d. A(0, 1). In the form of (6.3.17), the density of d can be 
represented as a definite integral (Plackett, 1960, p. 26). The significance 
points of d can be computed either by the direct evaluation of the integral 
(Imhof, 1961) or by approximating the density on the basis of the first few 
moments. Durbin and Watson (1950, 1951) chose the latter method and 
suggested two variants: the Beta approximation based on the first two mo- 
ments and the Jacobi approximation based on the first four moments. For a 
good discussion of many other methods for evaluating the significance points 
of d, see Durbin and Watson (1971). 

The problem of finding the moments of d is simplified because of the 
independence of d and the denominator of d (see Theorem 7 of Appendix 2), 


which implies 
T-K s 
E ( > wa) 
in} 


E(@)=—= n> (6.3.18) 
(Fe) 
im] 
for any positive integer s. Thus, for example, 
1 T-K 
Ed = T-K p Vi (6.3.19) 
and 
T-K 
2 > (v, -= Edy 
Vd = —————— (6.3.20) 


~ (T—K\T—K+ 2)’ 
where the term involving v, can be evaluated using 
T—K 


Y vj=tr (MAY. (6.3.21) 


i=j 
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The Beta approximation assumes that x = d/4 has a density given by 


— — xP" 1 — xe}, 0sxszl; O0 otherwise, (6.3.22 

Beg” 4 9 (6322) 
so that Ed = 4p/(p + q) and Vd = l6pq/[(p + qy(p + q+ 1)]. The Jacobi 
approximation assumes that x — d/4 has a density given by 


x« (1 — x)7* 
B(q, p — q * 1) 


where G, and G, are the third- and fourth-order Jacobi polynomials as defined 
by Abramowitz and Segun (1965, Definition 22.2.2). The parameters p and q 
are determined so that the first two moments of (6.3.23) are equated to those 
of d, and a4 and a, are determined so that the third and fourth moments 
match. In either approximation method the significance points can be com- 
puted by a straightforward computer program for integration. 

The distribution of d depends upon X. That means investigators must 
calculate the significance points for each of their problems by any of the 
approximation methods discussed in the previous section. Such computa- 
tions are often expensive. To make such a computation unnecessary in certain 
cases, Durbin and Watson (1950, 1951) obtained upper and lower bounds of d 
that do not depend on X and tabulated the significance points for these. The 
bounds for d are given by 


[1 + a4G3€0 + a4G4()], (6.3.23) 


T—K T-K 
> AS? > Au kii 
d, = “FX and dy—-——,——, (6.3.24) 


>» $ > 3 


i-1 fot 

where 4, = 2(1 — cos ix T!) are the positive characteristic roots of A. (The 
remaining characteristic root of A is 4 = 0.) These bounds were obtained 
under the assumption that X contains a vector of ones and that no other 
column of X is a characteristic vector of A2 Durbin and Watson (1951) 
calculated the significance points (at the 1, 2.5, and 5% levels) of d; and dy for 
K=1,2,...,6and T—15,16,. . . , 100 by the Jacobi approximation. 
Let dia and dy, be the critical values of these bounds at the a% significance 
level. Then test Ho: p = O against H,: p > 0 by the procedure: 


Reject H, if dd, (6.3.25) 
Accept Hy if d2 dug. 
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If dj, < d € dy, the test is inconclusive. To test Hy: p = 0 against H,: p < 0, 
Durbin and Watson suggested the following procedure: 


Reject H, if dz4—d, (6.3.26) 
Accept Hp if dS4—dyq. 


If these tests are inconclusive, the significance points of (6.3.17) must be 
evaluated directly. 


6.3.7 Joint Presence of Lagged Endogenous Variables and 
Serial Correlation 


In this last subsection we shall depart from Model 6 and consider briefly a 
problem that arises when X in (6.1.1) contains lagged values of y. An example 
of such a model is the geometric distributed-lag model (5.6.4), where the errors 
are serially correlated. This is an important problem because this situation 
often occurs in practice and many ofthe results that have been obtained up to 
this point in Section 6.3 are invalidated in the presence of lagged endogenous 
variables. 

Consider model (5.6.4) and suppose {u,} follow AR(1), u, = pu, , + €, 
where (e,) are i.i.d. with Ee, = 0 and Ve, = c?. The LS estimators of A and a are 
clearly inconsistent because plim ze 77!€7 ,y,..,u, ** 0. The GLS estimators 
of Aand a based on the true value of p possess not only consistency but also all 
the good asymptotic properties because the transformation R, defined in 
(5.2.10) essentially reduces the model to the one with independent errors. 
However, it is interesting to note that FGLS, although still consistent, does not 
have the same asymptotic distribution as GLS, as we shall show. 

Write (5.6.4) in vector notation as 


y=Ay_,taxtu=Zyt+u, (6.3.27) 


where y. , = (yo; Yis - - - » Yr—1)’. Suppose, in defining FGLS, we use a con- 
sistent estimator of p, denoted f, such that VT (f — p) converges to a non- 
degenerate normal variable. Then the FGLS estimator of y, denoted j;, is 
given by 


Pe = (Z’RiR,Z)'Z’RiRyy, (6.3.28) 


where R, is derived from R, by replacing p with f. The asymptotic distribution 
of fp can be derived from the following equations: 
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VT fp — ») = (TZ RiR,Z) TZ RR (6.3.29) 
P (plim T-!Z/R/R,Z)'! 
X [TZR R u + (T-Z Ê Ry 
— TZ/R'R,u)]. 


Let w be the first element of the two-dimensional vector (T^ '?Z/R;R,u — 
T-1?Z'RjR,u). Then we have 


n 1 T-2 
wr7 NTP — p). Y, yas (6.3.30) 
t=] 


^ 1 T-1 T-2 
—VT(p — p) T (5 Yi, + p Yea) 
2 
2ga 76-2 

Thus we conclude that the asymptotic distribution of FGLS differs from that 
of GLS and depends on the asymptotic distribution of p (see Amemiya and 
Fuller, 1967, for further discussion). Amemiya and Fuller showed how to 
obtain an asymptotically efficient feasible estimator. Such an estimator is not 
as efficient as GLS. 

The theory of the test of independence discussed in the preceding subsec- 
tion must also be modified under the present model. If X contained lagged 
dependent variables, y,.,, y, 2, . . . , we will still have Eq. (6.3.17) formally, 
but (£j) will no longer be independent normal because H will be a random 
matrix correlated with u. Therefore the Durbin-Watson bounds will no longer 
be valid. 

Even the asymptotic distribution of d under the null hypothesis of indepen- 
dence is different in this case from the case in which X is purely nonstochastic. 
The asymptotic distribution of dis determined by the asymptotic distribution 
of f because of (6.3.14). When X is nonstochastic, we have VT(p — p) ^ 
N(0, 1 — p?) by the results of Section 6.3.2. But, if X contains the lagged 
dependent variables, the asymptotic distribution of J will be different. This 
can be seen by looking at the formula for /T(f — p) in Eq. (6.3.4). The third 
term, for example, of the right-hand side of (6.3.5), which is 
T-?XI (f — BY x,u,.., does not converge to 0 in probability because x, and 
u,_, are correlated. Therefore the conclusion obtained there does not hold. 
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We consider the asymptotic distribution of in the simplest case 
y, = AY, tu, U, = pu, + €, (6.3.31) 


where lal, |p| € 1, {€} are i.i.d. with Ee, = 0 and Ee? = o?, and both ( yj) and 
(14) are stationary. Define 


NE fi, 0, 
(6.3.32) 
»E n 
where ii, = y, — Gy,_, and & is the least squares estimator. Consider the limit 
distribution of VT. p under the assumption p = 0. Because the denominator 
times 7^! converges to g? in probability, we have asymptotically 


T 
\Tp 2- lys E (6.3.33) 


1144 
P——— 2 [u4 — (1 — a) y,_ 14]. 
Therefore the asymptotic variance (denoted AV) of VT f is given by 


T 
AVGTÀ) 7 —, 7. 2 [4,24 — (1 - oyu} (6.3.34) 


-arle (Zee) +e -are oes] 


f-2 


T T 
—2(1 — a3)E (5 UM, x »-u)] 


—oX1-— T-. 
Hence, assuming that the asymptotic normality holds, we have 
YT p — N(0, o?). (6.3.35) 


Durbin (1970) obtained the following more general result: Even if higher- 
order lagged values of y, and purely exogenous variables are contained among 
the regressors, we have under the assumption p = 0 


YT 5 — MO, 1 — AV(VTÀ,], (6.3.36) 
where å, is the least squares estimate of the coefficient on y,_,. He proposed 
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that the test of the independence be based on the asymptotic normality 
above.? 


6.4 Seemingly Unrelated Regression Model 


The seemingly unrelated regression (SUR) model proposed by Zellner (1962) 
consists of the following N regression equations, each of which satisfies the 
assumptions of the standard regression model (Model 1): 


y; = Xj, - u,, i= 1, 2,. oe , N, (6.4.1) 
where y, and u; are 7-vectors, X,isa T X K, matrix, and fj,isa K;-vector. Let u, 
be the tth element of the vector u;. Then we assume that (Uis, t,,. . . , Uye) iS 
an i.i.d. random vector with Eu, — 0 and Cov (uy, uy) = oy. Defining y = 
(Yi, Yi- -YM B (Bi, B... Bd)’, u= (uj, u5,. . . , uy)’, and 
X-diag(X,,X,,... , Xy), we can write (6.4.1) as 

y- Xf vu. (6.4.2) 


This is clearly a special case of Model 6, where the covariance matrix of u is 
given by 


Ew’ =Q9=} 0l, (6.4.3) 


where 2 = {ay} and ®© denotes the Kronecker product (see Theorem 22 of 
Appendix 1). 

This model! is useful not only for its own sake but also because it reduces to a 
certain kind of heteroscedastic model if X is diagonal (a model that will be 
discussed in Section 6.5) and because it can be shown to be equivalent to a 
certain error components model (which will be discussed in Section 6.6). 

The GLS estimator of fl is defined by Bg = (X’N7'X)-'X’Q™'y. Because of 
(6.4.3) we have Q~! = ^! © I, using Theorem 22 (i) of Appendix 1. In the 
special case in which X, = X, = . . . = Xy, we can show fo = Bas follows: 
Denoting the common value of X, by X and using Theorem 22 (i) of Appendix 
1 repeatedly, we have 


Bo = IX'(37' © DXI-'X 7! © Dy (6.4.4) 
— [1 € X'Xx-! © D(I @ X)]-'(1 6 XY"! © Dy 
= (X-! © X'X)- (^! 6 Xy 
= [L6 XXX] 
= (&’X)'X’y. 
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The same result can be also obtained by using Theorem 6.1.1. Statement E is 
especially easy to verify for the present problem. 

To define FGLS, we must first estimate X. A natural consistent estimator of 
its i,jth element is provided by 6, = T—'i/f,, where i; = y; — X,8, are the 
least squares residuals from the ith equation. These estimates are clearly 
consistent as T goes to © while N is fixed. Because of the special form of Q~! 
given in the preceding paragraph, it is quite straightforward to prove that 
FGLS and GLS have the same asymptotic distribution (as T — œ) under 
ee assumptions on u and X. The limit distribution of VT T (Bo - por 

J/T(,— P) is N(O, (lim T~'X’Q7'X)~']. Suitable assumptions on u and X 
can easily be inferred from Theorem 3.5.4. 

FGLS is generally unbiased provided that it possesses a mean, as proved in a 
simple, elegant theorem by Kakwani (1967). The exact covariance matrix of 
FGLS in simple situations has been obtained by several authors and compared 
with that of GLS or LS (surveyed by Srivastava and Dwivedi, 1979). A particu- 
larly interesting result is attributable to Kariya (1981), who obtained the 
following inequalities concerning the covariance matrices of GLS and FGLS 
in a two-equation model with normal errors: 


a a 2 
z s ————— 
Vie = Vies|1+ 7 


where r = rank[X, , X,]. 


| Vs, (6.4.5) 


6.5 Heteroscedasticity 


A heteroscedastic regression model is Model 6 where X is a diagona] matrix, 
the diagonal elements of which assume at least two different values. Hetero- 
scedasticity is a common occurrence in econometric applications and can 
often be detected by plotting the least squares residuals against time, the 
dependent variable, the regression mean, or any other linear combination of 
the independent variables. For example, Prais and Houthakker (1955) found 
that a variability of the residuals from a regression of food expenditure on 
income increases with income. In the subsequent subsections we shall con- 
sider various ways to parameterize the heteroscedasticity. We shall consider 
the estimation of the heteroscedasticity parameters as well as the regression 
coefficients. We shall also discuss tests for heteroscedasticity. 


6.5.1 Unrestricted Heteroscedasticity 


When heteroscedasticity is unrestricted, the heteroscedasticity is not parame- 
terized. So we shall treat each of the T variances (02) as an unknown parame- 
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ter. Clearly, we cannot consistently estimate these variances because we have 
but one observation per variance. Nevertheless, it is possible to estimate the 
regression coefficients fJ consistently and more efficiently than by LS and to 
test for heteroscedasticity. 

The idea of obtaining estimates of the regression coefficients that are more 
efficient than LS in an unrestricted heteroscedastic regression model has been 
developed independently by White (1982) and by Chamberlain (1982), fol- 
lowing the idea of Eicker (1963), who suggested an estimator of Vf (6.1.5) that 
does not require the consistent estimation of X. The following discussion 
follows Amemiya (1983c). 

Let W bea T X (T — K) matrix of constants such that [X, W] is nonsingu- 
lar and W’X = 0. Then, premultiplying (6.1.1) by [X, W]', GLS estimation of 
f can be interpreted as GLS applied simultaneously to 


X'y = X'Xf + X'u (6.5.1) 
and 

W'y — W'u. (6.5.2) 
Thus, applying Theorem 13 of Appendix 1 to 


X'ZIX  X'ZW ['! 
W'ZX WIW] ’ 


we obtain 
Ês 7 B— (X'X)X'EW(W'XW)"W'y. (6.5.3) 


Of course, it is also possible to derive (6.5.3) directly from (6.1.3) without 
regard to the interpretation given above. An advantage of (6.5.3) over (6.1.3) is 
that the former does not depend on X^!. Note that one cannot estimate 
T-!X'X^!X consistently unless X can be consistently estimated. To trans- 
form (6.5.3) into a feasible estimator, one is tempted to replace Z by a 
diagonal matrix D whose tth element is ( y, — x; f. Then it is easy to prove 
that under general assumptions plim 7~'X’DW = plim 77X'ZW and 
plim 7-'W’DW = plim T~'W’ SW element by element. However, one dif- 
ficulty remains: Because the size of these matrices increases with 7, the result- 
ing feasible estimator is not asymptotically equivalent to GLS. 

We can solve this problem partially by replacing (6.5.2) with 


Wiy = Wiu, (6.5.4) 
where W, consists of N columns of W, N being a fixed number. When GLS is 
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applied to (6.5.1) and (6.5.4), it is called the partially generalized least squares 
(PGLS) and the estimator is given by 


Bo = B— (X'X)X'EW,((WAXW,)-!'W.y. (6.5.5) 
PGLS is more efficient than LS because 
VB — V, -(X'X) X'IW(W|XW)-"W;ZX(X'X) , — (65.9 


which is clearly nonnegative definite, but it is less efficient than GLS. An 
asymptotically equivalent feasible version of fly is obtained by replacing the £ 
in (6.5.5) by the D defined in the preceding paragraph. 

White (19802) proposed testing the hypothesis a? = c? for all t by compar- 
ing (X'X)"! X'DX(X' X)! with o?(X’X)-', where D is as defined earlier and 
8? is the least squares estimator of g? defined in (1.2.5). Equivalently, White 
considered elements of X’DX — 6?X'X. If we stack the elements of the upper 
triangular part of this matrix, we obtain a vector of (K? + K)/2 dimension 
defined by S'(ü? — G71), where à? is a 7-vector the tth element of which is 
12, lis a T-vector of ones, and S is a T X (K? + K)/2 matrix, the columns of 
which are (xX), XpXp,. . . , XTX) for 1 Si, j SK, and i £ j. It is easy 
to show that T-'2S'(ü? — 671) — N(0, A), where A= lim (T—'!S’AS + 
TAL: SWS — TSA - VS — TS’: l'AS) and A= E(u? — ol). 
(u? — 071)’. The test statistic proposed by White is 


(a2 — 671)’S(TA)-'S’(a2 — 671), 


where A is obtained from A by eliminating lim and replacing A with 
D((a? — 6)?}. This statistic is asymptotically distributed as chi-square with 
(K? + K)/2 degrees of freedom under the null hypothesis. It can be simply 
computed as TR? from the regression of 2? on the products and cross products 
of x,. 


6.5.2 Constant Variance in a Subset of the Sample 


The heteroscedastic model we shall consider in this subsection represents the 
simplest way to restrict the number of estimable parameters to a finite and 
manageable size. We assume that the error vector u is partitioned into N 
nonoverlapping subvectors as u = (uj, u$,. . . , uy)' such that Euu; = 
Oil, We can estimate each c7 consistently from the least squares residuals 
provided that each 7; goes to infinity with T. Note that this model is a special 
case of Zellner's SUR model discussed in Section 6.4, so that the asymptotic 
results given there also hold here. We shall consider a test of homoscedasticity 
and the exact moments of FGLS. 


Generalized Least Squares Theory 201 


If we assume the normality ofu, the hypothesis a? = c? for all ican be tested 
by the likelihood ratio test in a straightforward manner. Partitioning y and X 
into N subsets that conform to the partition of u, we can write the constrained 
and unconstrained log likelihood functions respectively as 


CLL — — Zilog o? — i È (y; — xiB)' y, — XA) (6.5.7) 
and 
ULL--l$ ri ee ;— Xifft (y; — 8 
2 p 108 01 77 p oi (y, — X;f (y, - X,B. (6.5.8) 
Therefore — 2 times the log likelihood ratio is given by 
—2 log LRT = ` T; log (62/61), (6.5.9) 


where G? and 6? are the constrained and unconstrained MLE, respectively. 
The statistic is asymptotically distributed as chi-square with N — 1 degrees of 
freedom.* 

Taylor (1978) has considered a special case in which N = 2 in a model with 
normal errors and has derived the formulae for the moments of FGLS. By 
evaluating the covariance matrix of FGLS at various parameter values, Taylor 
has shown that FGLS is usually far more efficient than LS and is only slightly 
less efficient than GLS. 

We shall sketch briefly the derivation of the moments of FGLS. Let C bea 
K X K matrix such that 


C’X{X,C = cil (6.5.10) 
and 

C’X3X,C = oi (6.5.11) 
where A is a diagonal matrix, the elements 4,, 4,,. . . , Ax of which are the 
roots of the equation 

|oz?X$X, — Aoy?X1X,| = 0. (6.5.12) 


The existence of such a matrix is guaranteed by Theorem 16 of Appendix 1. 
With S = C(I + A)~”, transform the original equation y = Xf + u to 


y=X*y+a, (6.5.13) 
where X* = XS and y = S^!f. The FGLS estimator of y, denoted J, is given by 
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P= XEXE + G72XF/ XP) ET XY y 07 X$f'y), (6.5.14) 


where d? = (T; — K) (y, — X,B)'(y, — XÂ. 
Using o;?X?'Xf = (1+ A)! and 02?X2' X$ = A(I + A)! !, we obtain 


$—y= (1+ Ayo1?DX*?'u, + 072A" (I - DXf'uj], (6.5.15) 


where D is a diagonal matrix the ith diagonal element of which is equal to 
0102/(0103 + 4,0361). Finally, the moments of f — y can be obtained by mak- 
ing use of the independence of D with Xf'u, and X2'u, and of known 
formulae for the moments of D that involve a hypergeometric function. 

Kariya (1981) has derived the following inequalities concerning the covar- 
iance matrices of GLS and FGLS in a two-equation model: 


I 1 a 
vÊ, = VB; = [ *xr-x-5* xnl Vf. (6.5.16) 


6.5.3 General Parametric Heteroscedasticity 


In this subsection we shall assume o? = g(a, fl) without specifying g, where fl, 
is a subset (possibly whole) of the regression parameters f and o is another 
vector of parameters unrelated to fj. In applications it is often assumed that 
gda, B,) = g(a, x; B). The estimation of a and ff can be done in several steps. 
In the first step, we obtain the LS estimator of $, denoted fl. In the second step, 
a and fi, can be estimated by minimizing ZL,[27 — ea, f), where fj, = 

— xÊ. The consistency and the asymptotic normality of the resulting esti- 
ater denoted & and fl, , have been proved by Jobson and Fuller (1980). In 
the third step we have two main options: FGLS using g(à, B,) or MLE under 
normality using & and £ as the initial estimates in some iterative algorithm. 
Carroll and Ruppert (1982a) proved that under general assumptions FGLS 
has the same asymptotic distribution as GLS. Jobson and Fuller derived 
simple formulae for the method of scoring and proved that the estimator of f 
obtained at the second iteration is asymptotically efficient (and is asymptoti- 
cally more efficient than GLS or FGLS). Carroll and Ruppert (1982b) have 
pointed out, however, that GLS or FGLS is more robust against a misspecifi- 
cation in the g function. Carroll and Ruppert (1982a) have proposed a robust 
version of FGLS (cf. Section 2.3) in an attempt to make it robust also against 
nonnormality. 

There are situations where FGLS has the same asymptotic distribution as 
MLE. Amemiya (1973b), whose major goal lay in another area, compared the 
asymptotic efficiency of FGLS vis-à-vis MLE in cases where y, has mean x; 
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and variance 7?(x/f)? and follows one of the three distributions: (1) normal, 
(2) lognormal, and (3) gamma. This is the form of heteroscedasticity suggested 
by Prais and Houthakker (1955). It was shown that FGLS is asymptotically 
efficient if y, has a gamma distribution. Thus this is an example of the BAN 
estimators mentioned in Section 4.2.4. 

This last result is a special case of the following more general result attribut- 
able to Nelder and Wedderburn (1972). Let ( yj) be independently distributed 
with the density 


Sy) = exp {a(d)[g(@)y, — AO) + klå, yd}, (6.5.17) 


where q and 6, = q(x,, f) are scalars and A, x,, and flare vectors. It is assumed 
that A’(@,) = 6,9’(@,) for every t, which implies that Ey, = 0, and Vy, = 
[a(A)g’ (8)] "Then, in the estimation of f, the method of scoring is identical 
to the Gauss-Newton nonlinear weighted least squares iteration (cf. Section 
4.4.3). The binomial distribution is a special case of (6.5.17). To see this, 
define the binary variable y, that takes unity with probability 0, and zero with 
probability 1 — 0, and put a(4) ^ 1, g(0,) = log 0, — log (1 — 6), (0) = 
— log (1 — 0), and k(A, yj) = 0. The normal distribution is also a special case: 
Take y,— N(6, 42), a(4) =A, g(8) — 06, h(0)-— 02/2, and k(A, y) = 
—2714? log (274?) — y2/2. The special case of Amemiya (1973b) is obtained 
by putting o(4)— —4À, $(6)—0,', h(O,)=—log6,, and k(Ay)- 
A^! log I'(4) — log A + 4711 — A) log y,. 

There are series of articles that discuss tests of homoscedasticity against 
heteroscedasticity of the form c2 = g(a, x; ff), where a is a scalar such that 
&(0, x/8) does not depend on ¢. Thus the null hypothesis is stated as a = 0. 
Anscombe (1961) was the first to propose such a test, which was later modified 
by Bickel (1978). Hammerstrom (1981) proved that Bickel's test is locally 
uniformly most powerful under normality. 


6.5.4 Variance as a Linear Function of Regressors 


In this subsection we shall consider the model o? = z/a, where z, is a G-vector 
of known constants and a is a G-vector of unknown parameters unrelated to 
the regression coefficients 8. Such a parametric heteroscedastic regression 
model has been frequently discussed in the literature. This is not surprising, 
for this specification is as simple as a linear regression model and is more 
general than it appears because the elements of z, may be nonlinear transfor- 
mations of independent variables. The elements of z, can also be related to x, 
without affecting the results obtained in the following discussion. We shall 
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consider the estimation of a and the test of the hypothesis o? = a? under this 
model. 

Hildreth and Houck (1968) and Goldfeld and Quandt (1972) were among 
the first to study the estimation of œ. We shall call these estimators HH and 
GQ for short. We shall follow the discussion of Amemiya (1977b), who pro- 
posed an estimator of a asymptotically more efficient under normality than 
the HH and GQ estimators. 

Hildreth and Houck presented their model as a random coefficients model 
(cf. Section 6.7) defined by 


-— x/(f +), (6.5.18) 


where " are K-dimensional i.i.d. vectors with Ed, = 0 and EC = D{a,}. 
The last matrix denotes a K X K diagonal matrix, the tth element of which is 
Q. Thus the Hildreth and Houck model is shown to be a special case of the 
mode) where the variance is a linear function of regressors by putting z, = 
(x?,, x3,,... , x2)’. 

We shall compare the HH, GQ, and Amemiya estimators under the as- 
sumption of the normality of u. All three estimators are derived from a 
regression model in which £2? serves as the dependent variable. Noting 2, = 
u, — X, (X' X) !X'u, we can write 


22 = 2;a + v, — 2v + v, (6.5.19) 


where vy = U? — 02, vy = ux;(X'X)^!X'u, and v, = [x/(X' X) !X'uJ. We 
can write (6.5.19) in vector notation as 


fi? = Za t vi -= 2v; + V3. (6.5.20) 


We assume that X fulfills the assumptions of Theorem 3.5.4 and that lim ;. ... 
T-1Z'Zis a nonsingular finite matrix. (Amemiya, 1977b, has presented more 
specific assumptions.) 

Equation (6.5.20) is not strictly a regression equation because Ev, # 0 and 
Ev, #0. However, they can be ignored asymptotically because they are 
O(T-) and O(T-), respectively. Therefore the asymptotic properties of LS, 
GLS, and FGLS derived from (6.5.20) can be analyzed as if v, were the only 
error term. (Of course, this statement must be rigorously proved, as has been 
done by Amemiya, 1977b.) 

The GQ estimator, denoted d, , is LS applied to (6.5.20). Thus 


à, = (Z'Z) ziv. (6.5.21) 
It can be shown that /T(à, — a) has the same limit distribution as 
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VT(Z'Z)Z'v,. Consequently, à, is consistent and asymptotically normal 
with the asymptotic covariance matrix given by 


Vi —2(Z'Z) 'Z'DZZz', (6.5.22) 


where D = Ew’ = D(z;a). 
The HH estimator å, is LS applied to a regression equation that defines 
Ei: 


i? = MZa + w, (6.5.23) 


where M is obtained by squaring every element of M = I — X(X'X) !X' and 
Ew = 0. Thus 


â = (Z/MM2Z)"'Z' Mi. (6.5.24) 
It can be shown that 
Eww’ = 2Q, (6.5.25) 


where Q = MDM * MDM. The symbol * denotes the Hadamard product 
defined in Theorem 23 of Appendix 1. (Thus we could have written M as 
M * M.) Therefore the exact covariance matrix of â, is given by 
V, = 2(Z’MMZ)~'Z’/MQMZ(Z’MMZ)~". (6.5.26) 
To compare the GQ and the HH estimators we shall consider a modifica- 
tion of the HH estimator, called MHH, which is defined as 
â? = (Z/MM2)?Z/MMá&:. (6.5.27) 


Because v, and v, in (6.5.20) are of O(T- 7) and O(T—"'), respectively, this 
estimator is consistent and asymptotically normal with the asymptotic covar- 
iance matrix given by 


V* -2(2/MM2Z)-'Z'MMD?MMZ(Z/ MMZ)"!. (6.5.28) 
Amemiya (19782) showed that 
Qz MD'M, (6.5.29) 


or that Q— MDM is nonnegative definite, by showing A'A*A'Az 
(A * A)'(A * A) for any square matrix A (take A = D!2M). The latter in- 
equality follows from Theorem 23 (i) of Appendix 1. Therefore V, = V$, 
thereby implying that MHH is asymptotically more efficient than HH. 
Now, from (6.5.27) we note that MHH is a "wrong" GLS applied to 
(6.5.20), assuming Ev,v; were (MM)! when, in fact, it is 2D?. Therefore we 
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cannot make a definite comparison between GQ and MHH, nor between GQ 
and HH. However, this consideration does suggest an estimator that is 
asymptotically more efficient than either GQ or MHH, namely, FGLS ap- 
plied to Eq. (6.5.20), 


ô, = (Z/D? 2) z' Diu, (6.5.30) 


where D, = D(z/à,). Amemiya (1977b) showed that the estimator is con- 
sistent and asymptotically normal with the asymptotic covariance matrix 
given by 


V4 = 2(Z'/D?2Z) !. (6.5.31) 


Amemiya (1977) has also shown that the estimator has the same asymptotic 
distribution as MLE.? 

It should be noted that all four estimators defined earlier are consistent even 
if u is not normal, provided that the fourth moment of u, is finite. All the 
formulae for the asymptotic covariance matrices in this subsection have been 
obtained under the normality assumption, and, therefore, the ranking of 
estimators is not preserved in the absence of normality. 

So far we have concentrated on the estimation of œ. Given any one of the 
estimators of a defined in this subsection, say &,, we can estimate f by FGLS 


Ê = Oc 53 3) X’ Âz y. (6.5.32) 


We can iterate back and forth between (6.5.30) and (6.5.32). That is, in the 
next round of the iteration, we can redefine û as y —Xf; and obtain a new 
estimate of à, which in turn can be used to redefine D,, and so on. Buse 
(1982) showed that this iteration is equivalent under normality to the method 
of scoring. Goldfeld and Quandt (1972) presented a Monte Carlo study of the 
performance of LS, FGLS, and MLE in the estimation of fin the model where 
Yi = By + Box, + u and Vu, = a, + aX + ox. 

We shall conclude this subsection with a brief account of tests of homosce- 
dasticity in the model where the variance is a linear function of regressors. 
Assuming that the first column of Z is a vector of ones, the null hypothesis of 
homoscedasticity can be set up as Hy: a = (72, 0,0, . . . , 0)’ in this model. 
Breusch and Pagan (1979) derived Rao's score test (cf. Section 4.5.1) of H, as 


Rao = sF (82 — W ZZ ZY-'Z' (8? — 6, (6.5.33) 


where | is the 7-vector of ones. It is asymptotically distributed as y- . Because 
the asymptotic distribution may not be accurate in small sample, Breusch and 
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Pagan suggested estimating P(Rao > c) by simulation. They pointed out that 
since Rao's score test depends only on 6, the simulation is relatively simple. 
For a simple way to make (6.5.33) robust to nonnormality, see Koenker 
(1981b). 

Goldfeld and Quandt (1965) proposed the following nonparametric test of 
homoscedasticity, which they called the peak test: First, order the residuals f, 
in the order of z; for a given estimator å. This defines a sequence {fp} where 
j £ kif and only if z;à £ zj&. (Instead of z;&, we can use any variable that we 
suspect influences a? most significantly.) Second, define that a peak occurs at 
j> lifand only if|2| > |a,| for all k < j. Third, if the number of peaks exceeds 
the critical value, reject homoscedasticity. Goldfeld and Quandt (1965, 1972) 
presented tables of critical values. 


6.5.5 Variance as an Exponential Function of Regressors 


As we mentioned before, the linear specification, however simple, is more 
general than it appears. However, a researcher may explicitly specify the 
variance to be a certain nonlinear function of the regressors. The most natural 
choice is an exponential function because, unlike a linear specification, it has 
the attractive feature of ensuring positive variances. Harvey (1976), who as- 
sumed y, — N[x/f, exp (z;a)], proposed estimating œ by LS applied to the 
regression of log 42 on z, and showed that the estimator is consistent if 1.2704 
is subtracted from the estimate of the constant term. Furthermore, the esti- 
mator has an asymptotic covariance matrix equal to 4.9348(Z’Z)~', more 
than double the asymptotic covariance matrix of MLE, which is 2(Z/Z) !. 


6.6 Error Components Models 


Error components models are frequently used by econometricians in analyz- 
ing panel data — observations from a cross-section of the population (such as 
consumers, firms, and states, and hereafter referred to as individuals) at var- 
ious points in time. Depending on whether the error term consists of three 
components (a cross-section-specific component, a time-specific component, 
and a component that depends both on individuals and on time) or two 
components (excluding the time-specific component), these models are called 
three error components models (3ECM) and two error components models 
(2ECM), respectively. Two error components models are more frequently 
used in econometrics than 3ECM because it is usually more important to 
include a cross-section-specific component than a time-specific component. 
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For example, in studies explaining earnings, to which error components 
models have often been applied, the cross-section-specific component may be 
regarded as the error term arising from permanent earnings and is more 
important than the time-specific component. 

We shall discuss first 3ECM and then various generalizations of 2ECM. 
These generalizations are (1) the Balestra-Nerlove model (2ECM with lagged 
endogenous variables), (2) 2ECM with a serially correlated error, and (3) 
2ECM with endogenous regressors. 


6.6.1 Three Error Components Models 
Three error components models are defined by 
ye = XEB ^ us, i=1,2,...,N, (6.6.1) 
t=1,2,...,7, 
and 
Un = My tA, + Er, (6.6.2) 


where 4i; and 4, are the cross-section-specific and time-specific components 
mentioned earlier. Assume that the sequence (uj), {A,}, and (e,) are iid. 
random variables with zero mean and are mutually independent with the 
variances a2, 01, and o2, respectively. In addition, assume that x, is a K-vector 
of known constants, the first element of which is 1 for all i and t. 

We shall write (6.6.1) and (6.6.2) in matrix notation by defining several 
symbols. Define y, u, e, and X as matrices of size NT X 1, NTX 1, NT X 1, 
and NT'X K, respectively, such that their [( — 1)7 + ¢]th rows are yp, 
Un, €,, and xj, respectively. Also define “= (4,45, . - , uy), A= 


(41,45, -© s 4p; L = I, © IH, wherelzisa T-vector of ones, and I = ly © Iz. 
Then we can write (6.6.1) and (6.6.2) as 

y- Xf-tu (6.6.3) 
and 

u — Ly t IA €. (6.6.4) 


The covariance matrix Q = Ew’ can be written as 
N = 02A + o1B + ol yz, (6.6.5) 
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where A = LL’ and B = II. Its inverse is given by 
1 
Q7 = yr 71A — 7B + 759), (6.6.6) 


where y, = 03(02 + To?) |, 

y2 = 0X02 + Noi), 

3 = V1%2(202 + Ta? + No3Y02 + To? + No2) 
and J is an NT X NT matrix consisting entirely of ones. 

In this model the LS estimator of f is unbiased and generally consistent if 
both N and T go to œ, but if Q is known, GLS provides a more efficient 
estimator. Later we shall consider FGLS using certain estimates of the var- 
iances, but first we shall show that we can obtain an estimator of £ that bas the 
same asymptotic distribution as GLS (as both N and T go to œ) but that does 
not require the estimation of the variances. 

To define this estimator, it is useful to separate the first element of f (the 
intercept) from its remaining elements. We shall partition f = (fp, Bi)’ and 
X = (1, X,).? We shall call this estimator the transformation estimator because 
it is based on the following transformation, which eliminates the cross-sec- 
tion- and time-specific components from the errors. Define the NT X NT 
matrix 

Q-1-laA-lpg4.lj (6.6.7) 
T N NT ` B 
It is easy to show that Q is a projection matrix of rank NT — N — T + 1 thatis 
orthogonal to l, L, and I. Let H be an NT X (NT — N — T + 1) matrix, the 
columns of which are the characteristic vectors of Q corresponding to the unit 
roots. Premultiplying (6.6.3) by H’ yields 


H’y = H'X,f, + H'e, (6.6.8) 


which is Model 1 because EH’ee’H = o@. The transformation estimator of 
Bı, denoted fou, is defined as LS applied to (6.6.8): 


^ 


Bar = (X1QX,) !X1Qy. (6.6.9) 
The transformation estimator can be interpreted as the LS estimator of fj, , 
treating 4,45, . . . , Uy and A44, 45, - . . , Àr as if they were unknown re- 


gression parameters. Then formula (6.6.9) is merely a special case of the 
general formula for expressing a subset of the LS estimates given in (1.2.14). 
This interpretation explains why the estimator is sometimes called the fixed- 
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effects estimator (since y’s and 4's are treated as fixed effects rather than as 
random variables) or the dummy-variable regression. Still another name for 
the estimator is the covariance estimator. 

To compare (6.6.9) with the corresponding GLS estimator, we need to 
derive the corresponding subset of GLS fg. We have 


Bor = [X(Q7!X, — X07 10 Q710)7 1 Q7!X,]7! (6.6.10) 
X [X(Q-!y — X: 10 071)-!1/Qy] 
= [XA — 5A — 5B t 5,J)X,] Xi — yA — yB + y4J)u, 


where y, = (NTo;01 — o$)/NT(o2 + To2)(02 + Na). Note the similarity be- 
tween Q and I — 7,A — 72B + y,J. The asymptotic equivalence between Bo, 
and fic, essentially follows from this similarity. If both Nand T go to o (it does 
not matter how they go to o), it is straightforward to prove that under reason- 
able assumptions on X and u, VNT (Bg; — B,) and VNT (fg, — B,) converge to 
N[0, lim NT(X1QX,)!]. A proof of the special case of l'X, — 0 has been 
given by Wallace and Hussain (1969). 

The GLS estimator of fj, is given by 


> Yy - rX ĝo 
Boo = NT (6.6.11) 
Similarly, we can define the transformation estimator of fj, by 
a Yy-YXÁ 
= IM 
Bao NT (6.6.12) 
We have 


n Vu—-VX( 8. — 
Boo — Bo = D Dui — Bi) (6.6.13) 


and similarly for Boo — By. Note that 


uc TY ut NA + Ye. (6.6.14) 

w] t=] 
where the probabilistic orders of the three terms in the right-hand side of 
(6.6.14) are TYN, NVT, and VNT, respectively. Because the probabilistic 
order of l X (flo; — A) or 1’X,(Bq, — B,) is YNT, it does not affect the asymp- 
totic distribution of Boo or Boo: Hence, these two estimators have the same 
asymptotic distribution. To derive their asymptotic distribution, we must 
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specify whether N or T goes to o faster. If N grows faster than T, VT (Boo — Bo) 
and VT( Boo — Po) have the same limit distribution as £Z, 4,/ VT, whereas if T 
grows faster than N, VN( Boo — By) and /N( Boo — Bo) have the same limit 
distribution as EX. u/VN. 

Because of the form of (1^! given in (6.6.6), FGLS can be calculated if we 
can estimate the three variances 02, 02, and o1. Several estimators of the 
variances have been suggested in the literature. (A Monte Carlo comparison of 
several estimators of thé variances and the resulting FGLS has been done by 
Baltagi, 1981.) Amemiya (1971) proved the asymptotic normality of the fol- 
lowing so-called analysis-of-variance estimators of the variances: 


4,  &Qà 


0:7 W- 1(T— 1)’ (6.6.15) 
([r-1, T-!,. 
NE er 3-98 6.6.16 
a T(N-IXT-1) , (6.6.16) 
and 
a’ [s-ino] 
6? = (6.6.17) 


N(N— 1(T—1) > 


where à = y — XÁ,. Amemiya also proved that they are asymptotically more 
efficient than the estimates obtained by using y — Xf for fi, where £ is the LS 
estimator. 

These estimates of the variances, or any other estimates with the respective 
probabilistic order of (NT)-!2, N71, and T~'?, can be inserted into the 
right-hand side of (6.6.6) for calculating FGLS. Fuller and Battese (1974) 
proved that under general conditions FGLS and GLS have the same asymp- 
totic distribution. 


6.6.2 Two Error Components Model 


In 2ECM there is no time-specific error component. Thus the model is a 
special case of 3ECM obtained by putting a? = 0. This model was first used in 
econometric applications by Balestra and Nerlove (1966). However, because 
in the Balestra-Nerlove model a lagged endogenous variable is included 
among regressors, which causes certain additional statistical problems, we 
shall discuss it separately in Section 6.6.3. 
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In matrix notation, 2ECM is defined by 
y=Xftu (6.6.18) 
and 
u=Lute. (6.6.19) 
The covariance matrix Q = Euu’ is given by 
Q = 02A + o2lyz, (6.6.20) 


and its inverse by 
| 
Q= vi (Ivr — 714), (6.6.21) 


where y, = 02/(02 + Ta?) as before. 

We can define GLS and FGLS as in Section 6.6.1. We shall discuss the 
estimation of y, required for FGLS later. The asymptotic equivalence of GLS 
and FGLS follows from Fuller and Battese (1974) because they allowed for the 
possibility of 7? = 0 in their proof. The transformation estimator of $ can be 
defined as in (6.6.9) and (6.6.12) except that here the transformation matrix Q 
is given by 


Q-I- TA. (6.6.22) 


The asymptotic equivalence between GLS and the transformation estimator 
can be similarly proved, except that in the 2ECM model VN( [E Bo) and 
VN (Boo — Bo) have the same limit distribution as 2%, u,/VN regardless of the 
way N and T go to ™. 

Following Maddala (1971), we can give an interesting interpretation of GLS 
in comparison to the transformation estimator. Let L be as defined in Section 
6.6.1 and let F be an NT X N(T — 1) matrix satisfying F’L = 0 and F’F = I. 
Then (6.6.18) can be equivalently written as the following two sets of regres- 
sion equations: 


T-VIL/y = T-?L'Xf +n, (6.6.23) 
and 
F'y = F’xf + M, (6.6.24) 


where Enym = (02+ To2)ly, Enm-— odnr-» and Emm = 0. Maddala 
calis (6.6.23) the between-group regression and (6.6.24) the within-group 
regression. The transformation estimator Jo, can be interpreted as LS applied 
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to (6.6.24). (Note that since the premultiplication by F' eliminates the vector 
of ones, f, cannot be estimated from this equation.) GLS ffo can be interpreted 
as GLS applied to (6.6.23) and (6.6.24) simultaneously. Because these equa- 
tions constitute the heteroscedastic regression model analyzed in Section 
6.5.2, GLS has the following simple form: 


Po = (X'PX + cX'MX) (X'Py + cX'My), (6.6.25) 
where c = (02+ To?)/a2, P = T~'LL’, and M = I — P. To define FGLS, c 
may be estimated as follows: Estimate o} + To? by the LS estimator of the 
variance obtained from regression (6.6.23), estimate o2 by the LS estimator of 
the variance obtained from regression (6.6.24), and then take the ratio. 

As we noted earlier, (6.6.23) and (6.6.24) constitute the heteroscedastic 
regression model analyzed in Section 6.5.2. Therefore the finite-sample study 
of Taylor (1978) applied to this model, but Taylor (1980) dealt with this model 
specifically. 

Next, following Balestra and Nerlove (1966), we derive MLE of the model 
assuming the normality of u. For this purpose it is convenient to adopt the 
following re-parameterization used by Balestra and Nerlove: Define c? = 
o? + 02, p = 0}/a?, and R = (1 — pJIz.4 plyl. Then we have 


Q — oX1, G R), 
(7! = c-XI, ®© R7!), 
R'!-—(1—59) KIz— [p/ — p + pT)M7), 
and IR| = (1 — p)7[1 + pT — p). 


Using these results, we can write the log likelihood function as 
L=- i log |£3| — 5 u'Q-'u (6.6.26) 


NT NT 
=- 80- 


-N p 
2 log (1 — p) Y loe (1 + ) 


1—p 
— d w(I® Ru 
2g? , 
where we have written u for y — X$. From (6.6.26) it is clear that the MLE of $ 
given p is the same as GLS and that the MLE of a? must satisfy 


qi VOOR u 


NT (6.6.27) 
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Inserting (6.6.27) into (6.6.26) yields the concentrated log likelihood function 


NT ? 
rm Tice SY Prem x(4)] e 


Putting ðL*/ðp = 0 yields 
1 — 


ira.) cu 


(6.6.29) 
from which we obtain 
x(z4|-rz4 
= L n. 6.6.30 
i t 
Also, using (6.6.29,) we can simplify (6.6.27) as 
È > Mh 
2= iit 
c NT (6.6.31) 


The.MLE of $, p, and o? can be obtained by simultaneously solving the 
formula for GLS, (6.6.30), and (6.6.31). 

Both Balestra and Nerlove (1966) and Maddala (1971) pointed out that the 
right-hand side of (6.6.30) can be negative. To ensure a positive estimate of p, 
Balestra and Nerlove suggested the following alternative formula for p: 


GG») - (zx) 


It is easy to show that the right-hand of (6.6.32) is always positive. 

Maddala (1971) showed that the p given in (6.6.30) is less than 1. Berzeg 
(1979) showed that if we allow for a nonzero covariance d, between pand €y, 
the formulae for MLE are the same as those given in (6.6.30) and (6.6.31) by 
redefining o? = a2 + 26, + o2and p = (02 + 20,,)/0? and in this model the 
MLE of p lies between O and 1. 


(6.6.32) 
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One possible way to calculate FGLS consists of the following steps: (1) 
Obtain the transformation estimator fy. (2) Define fig = y — Xf; (3) Insert 
fig into the right-hand side of Eq. (6.6.32). (4) Use the resulting estimator of p 
to construct FGLS. In the third step the numerator of (6.6.32) divided by 
N?T? can be interpreted as the sample variance of the LS estimator of 4; + Bo 
obtained from the regression y = Xf, + L(u + Aol) + €. 


6.6.3 Balestra-Nerlove Model 


As we mentioned earlier, this is a generalization of 2ECM in the sense that a 
lagged endogenous variable y; ,. , is included among the regressors. Balestra 
and Nerlove (1966) used this model to analyze the demand for natural gas in 
36 states in the period 1950- 1962. 

All the asymptotic results stated earlier for 2ECM hold also for the Balestra- 
Nerlove model provided that both N and T go to «, as shown by Amemiya 
(1967). However, there are certain additional statistical problems caused by 
the presence of a lagged endogenous variable; we shall delineate these prob- 
lems in the following discussion. 

First, the LS estimator of 8 obtained from (6.6.18) is always unbiased and is 
consistent if N goes to ^». However, if x, contain y, ,.., , LS is inconsistent even 
when both N and T go to œ. To see this, consider the simplest case 


Yu = BY in + Ma + Em (6.6.33) 


where we assume|| < 1 for stationarity and y, = 0 for simplicity. Solving the 
difference equation (6.6.33) and omitting i, we obtain 


— fgt-1 

yea m-— is j ———-—nuteaQtfeat...tft'?e. (6.6.34) 
Therefore, putting back the subscript i, 

plim + ly ya =. (6.6.35) 

T—* T£ my] ht l — p 
Therefore 

1 N T o} 
plim 7. È D Hui = , (6.6.36) 
e TAS m 1-5 


which implies the inconsistency of LS. 
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Second, we show that the transformation estimator [» is consistent if and 
only if T goes to ».9 We have 


Bar — f, = (X1QX)"!X1Qe, (6.6.37) 


where Q is given by (6.6.22). We need to consider only the column of X,, 
which corresponds to the lagged endogenous variable, denoted y. , , and only 
the 7—'A part of Q. Thus the consistency (as T — œ) of Bg, follows from 


1 
NpYaAe- NT, $ (5 Yu Z Se). (6.6.38) 


Third, if y4 (the value of y,,_, at time t = 1) is assumed to be an unknown 
parameter for each i, the MLE of fl is inconsistent unless T goes to œ% (Ander- 
son and Hsiao, 1981, 1982). The problem of how to specify the initial values is 
important in a panel data study, where typically N is large but T is not. 

Fourth, the possibility of a negative MLE ofpis enhanced by the presence of 
a lagged endogenous variable, as shown analytically by Maddala (1971) and 
confirmed by a Monte Carlo study of Nerlove (1971). In his Monte Carlo 
study, Nerlove compared various estimators of f and concluded that the 
FGLS described at the end of Section 6.6.2 performs best. He found that the 
transformation estimator of the coefficient on y,,_, has a tendency of down- 
ward bias. 


6.6.4 Two Error Components Model with a Serially Correlated Error 


In the subsection we shall discuss the 2ECM defined by Eqs. (6.6.18) and 
(6.6.19) in which € follows an AR(1) process, that is, 


€; = y€i-1 t Gi, (6.6.39) 


where (6) are i.i.d. with zero mean and variance gł. As in the Balestra-Ner- 
love model, the specification of ey will be important if T is small. Lillard and 
Willis (1978) used model (6.6.39) to explain the log of earnings by the inde- 
pendent variables such as race, education, and labor force experience. They 
assumed stationarity for (€;), which is equivalent to assuming Ee, = 0, 
Veg = 02/(1 — 7°), and the independence of ey from ča, $5, . . . . Thus, in 
the Lillard- Willis model, Euu’ = (021717. + I) O Iy, where T is like G. 2. 2 
with c? = g2and p = y. Lillard and Willis estimated f and u by LS and 02,07, 
and y by inserting the LS residuals into the formulae for the normal MLE. 
Anderson and Hsiao (1982) have presented the properties of the full MLE in 
this model. 

The model of Lillard and Weiss (1979) is essentially the same as model 
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(6.6.39) except that in the Lillard-Weiss model there is an additional error 
component, so that u = Lu + [I- L(IL/L)L'IO f) +e, where f= 
(1,2,. .. , TY and Cis independent of u and e. The authors used LS, FGLS, 
and MLE to estimate the parameters of their model.!° The MLE was calcu- 
lated using the LISREL program developed by Joreskog and Sorbom (1976). 
Hause (1980) has presented a similar model. 

Finally, MaCurdy (1982) generalized the Lillard- Willis model to a more 
general time series process for €x. He eliminated u; by first differencing and 
treating y, — y;,-., as the dependent variable. Then he tried to model the LS 
predictor for €, — €,,_, by a standard Box-Jenkins-type procedure. MaCurdy 
argued that in a typical panel data model with small 7 and large N the 
assumption of stationarity is unnecessary, and he assumed that the initial 
values €y, €, ,,. . . are i.i.d. random variables across į with zero mean and 
unknown variances. 


6.6.5 Two Error Components Model with Endogenous Regressors 

The two error components model with endogenous regressors is defined by 
y7Xf-ZytYótyuce, (6.6.40) 
y 7-XjftZytYÓéótute, 


yr" Xr + Zy + Yró t u t er, 


where y, ¿= 1,2,. . . , T, isan N-vector, X, is an N X K matrix of known 
constants, Z is an NX G matrix of endogenous variables, Y, is an NX F 
matrix of endogenous variables, and u and e, are N-vectors of random vari- 
ables with the same characteristics as in 2ECM. The variable Z is always 
assumed to be correlated with both u and e,, whereas Y, is sometimes assumed 
to be correlated with both u and €, and sometimes only with u. The two error 
components model with endogenous regressors was analyzed by Chamberlain 
and Griliches (1975) and Hausman and Taylor (1981). Chamberlain and 
Griliches discussed maximum likelihood estimation assuming normality, 
whereas Hausman and Taylor considered the application of instrumental 
variable procedures. 

Amemiya and MaCurdy (1983) proposed two instrumental variables esti- 
mators: one is optimal if Y, is correlated with e, and the other is optimal if Y, is 


218 Advanced Econometrics 


uncorrelated with e,. When we write model (6.6.40) simply as y = Wa + u, 
the first estimator is defined by 


6, = (W'Q7?P,Q72W)"W'Q717P,Q7 y, (6.6.41) 


where P, is the projection matrix onto the space spanned by the column 
vectors of the NTXKT? matrix QI ®S) where S= 
(Xi, X,,. . . , X7). Amemiya and MaCurdy have shown that it is asymp- 
totically optimal among all the instrumental variables estimators if Y, is cor- 
related with €,. The second estimator is defined by 


à; = (W'AP, WYW OPA 2y, (6.6.42) 


where P; = I — 77!117 € [Iy — S(S'S)"!S']. It is asymptotically optimal 
among all the instrumental variables estimators if Y, is uncorrelated with e,. 
The second estimator is a modification of the one proposed by Hausman and 
Taylor (1981). In both of these estimators, Q must be estimated. If a standard 
consistent estimator is used, however, the asymptotic distribution is not af- 
fected. 


6.7 Random Coefficients Models 


Random coefficients models (RCM) are models in which the regression coef- 
ficients are random variables, the means, variances, and covariances of which 
are unknown parameters to estimate. The Hildreth and Houck model, which 
we discussed in Section 6.5.4, is a specíal case of RCM. The error components 
models, which we discussed in Section 6.6, are also special cases of RCM. We 
shall discuss models for panel data in which the regression coefficients contain 
individual-specific and time-specific components that are independent across 
individuals and over time periods. We shall discuss in succession models 
proposed by Kelejian and Stephan (1983), Hsiao (1974, 1975), Swamy (1970), 
and Swamy and Mehta (1977). In the last subsection we shall mention several 
other related models, including so-called varying parameter regression models 
in which the time-specific component evolves with time according to some 
dynamic process. As RCM have not been applied as extensively as ECM, we 
shall devote less space to this section than to the last. 


6.7.1 The Kelejian and Stephan Model 


The RCM analyzed by Kelejian and Stephan (1983) is a slight generalization 
of Hsiao’s model, which we shall discuss in the next subsection. Their model is 
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defined by 
yu = Xa + Xii + A) + €, (6.7.1) 


i=1,2,...,Nandt=1,2,... , T. Note that we have separated out the 
nonstochastic part f and the random part 4; + A, of the regression coefficients. 
Using the symbols defined in Section 6.6.1 and two additional symbols, we 
can write (6.7.1) in vector notation as 


y- Xf * Xy t X*À- e, (6.7.2) 
where we have defined X=diag(X,,X,,...,Xy), X*= 
(Xf, X2’,.. . , X¥’)’, where X* = diag (x4, x5,. . . , Xir). It is assumed 


that u, A, and € have zero means, are uncorrelated with each other, and have 
covariance matrices given by Euu'—IQ OZ, EAM =1,@%,, and 
Fee’ = X,, where %,, X,, and Z, are all nonsingular. 

Kelejian and Stephan were concerned only with the probabilistic order of 
the GLS estimator of ff — an important and interesting topic previously over- 
looked in the literature. For this purpose we can assume that Z,,, X, , and È, are 
known. We shall discuss the estimation of these parameters in Sections 6.7.2 
and 6.7.3, where we shall consider models more specific than model (6.7.1). In 
these models Z, is specified to depend on a fixed finite number of parameters: 
most typically, X, = a?Ly;. | 

The probabilistic order of fig can be determined by deriving the order ofthe 
inverse of its covariance matrix, denoted simply as V. We have 


V-! = X'[X(I 0 X)X' + A]-!X, (6.7.3) 


where A= X*(I. O IDX + Z,. Using Theorem 20 of Appendix 1, we 
obtain 


[X(I, © £)X' + AJ! (6.7.4) 
—AU-AUX[(IS0 2) + XAT X] XAT. 
Therefore, noting X = X(ly © Lj) and defining A = X’A~'X, we have 
V7! = (I,  IQ'(A — A[(Iy 9 2") + A ' Ayo © Lg. (6.7.5) 
Finally, using Theorem 19 (ii) of Appendix 1, we can simplify the (6.7.5) as 
V- = (ly OI [I,92)T A] ly @ Id) (6.7.6) 
or as 
VC = NZ! — (ly © 2) [Uy @ E79 +A AO 27). (677) 
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Equation (6.7.7) is identical with Eq. (11) of Kelejian and Stephan (1983, 


p. 252). 
Now we can determine the order of V—'. If we write the i, jth block subma- 
trix of [(Iy © 7) AJ 55 j—1,2,.... , N, as GÏ, the second term of the 


right-hand side of (6.7.7) can be written as 27 '(27., 2,G4%)Z7'. Therefore 
the order of this term is N?/T. Therefore, if T goes to © at a rate equal to or 
faster than N, the order of V^! is N. But, because our model is symmetric in i 
and t, we can conclude that if N goes to » at a rate equal to or faster than T, the 
order of V^! is T. Combining the two, we can state the order of V~' is 
min (N, T) or that the probabilistic order of f; is max (N^ 2, T^ V?). 


6.7.2 Hsiao's Model 


Hsiao's model (1974, 1975) is obtained as a special case of the model of the 
preceding subsection by assuming X, and Z, are diagonal and putting X, = 
ol yr. 

Hsiao (1975) proposed the following method of estimating Z,, 2,, and o°: 
For simplicity assume that X does not contain a constant term. A simple 
modification of the subsequent discussion necessary for the case in which X 
contains the constant term is given in the appendix of Hsiao (1975). Consider 
the time series equation for the ith individual: 


y, = X(B+ uj) + XfA-T €. (6.7.8) 


If we treat Ju, as if it were a vector of unknown constants (which is permissible 
so far as the estimation of £, and c? is concerned), model (6.7.8) is the 
heteroscedastic regression model considered in Section 6.5.4. Hsiao suggested 
estimating 2, and c? either by the Hildreth-Houck estimator (6.5.24) or their 
alternative estimator described in note 7 (this chapter). In this way we obtain 
N independent estimates of X}, and c?. Hsiao suggested averaging these N 
estimates. (Of course, these estimates can be more efficiently combined, but 
that may not be worth the effort.) By applying one of the Hildreth-Houck 
estimators to the cross-section equations for T time periods, X, can be simi- 
larly estimated. (In the process we get another estimate of c?.) 

Hsiao also discussed three methods of estimating f. The first method is 
FGLS using the estimates of the variances described in the preceding para- 
graph. The second method is an analog of the transformation estimator de- 
fined for ECM in Section 6.6.1. It is defined as the LS estimator of B obtained 
from (6.7.2) treating u and A as if they were unknown constants. The third 
method is MLE derived under the normality of y, by which f, 2,, Z;, and a? 
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are obtained simultaneously. Hsiao applied to his model the method of scor- 
ing that Anderson (1969) derived for a very general random coefficients 
model, where the covariance matrix of the error term can be expressed as a 
linear combination of known matrices with unknown weights. (Note that 
Anderson’s model is so general that it encompasses all the models considered 
in this chapter.) Hsiao essentially proved that the three estimators have the 
same asymptotic distribution, although his proof is somewhat marred by his 
assumption that these estimators are of the probabilistic order of (NT)~. 


6.7.83 Swamy's Model 


Swamy's model (1970) is a special case of the Kelejian-Stephan model 
obtained by putting 


X,-0 and X,-XG1L, 


where X = diag (02, 02, . . . , 03). It is more restrictive than Hsiao's model 
in the sense that there is no time-specific component in Swamy’s model, but it 
is more general in the sense that Swamy assumes neither the diagonality of $, 
nor the homoscedasticity of € like Hsiao. 

Swamy proposed estimating %, and X in the following steps: 

Step 1. Estimate o? by 6? = y/[I — X(X;X) !X;]y,/(T — K). 

Step 2. Define b, = (X; Xj) 'Xjy,.. 

Step 3. Estimate %, by  X,—(N—1)'ZiL,(bB— N'Zibjb)X 
(b, - N'ES b) — NO XLGQOIXIX). 

It is easy to show that 6? and X, are unbiased estimators of a? and X,, 
respectively. - 

Swamy proved that the FGLS estimator of f using 07 and È, is asymptoti- 
cally normal with the order N^! and asymptotically efficient under the 
normality assumption. Note that GLS is of the order of N^? in Swamy's 
model because, using (6.7.7), we have in Swamy's model 


N 
VU-NE-ES È È! + s?xxy | x. (6.7.9) 
im] 
= O(N) — O(N/T). 


6.7.4 The Swamy-Mehta Model 


The Swamy-Mehta model (1977) is obtained from the Kelejian-Stephan 
model by putting c? = 0 but making the time-specific component more gen- 
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eral as [diag (X*,X$,...,X%)]A", where EA*=0 and EA*A*’ = 
diag (I; ® ,, I ® Z;,. . . , Ip Xy). In their model, (6.7.7) is reduced to 


N 
VUA NXE; (Y (Sz! + XIX", ® sarx) x, 
i=] 


= O(N) — O(N/T), (6.7.10) 


as in Swamy’s model. 


6.7.5 Other Models 


In the preceding subsections we have discussed models in which cross-sec- 
tion-specific components are independent across individuals and time-spe- 
cific components are independent over time periods. We shall cite a few 
references for each of the other types of random coefficients models. They are 
classified into three types on the basis of the type of regression coefficients: (1) 
Regression coefficients are nonstochastic, and they either continuously or 
discretely change over cross-section units or time periods. (2) Regression 
coefficients follow a stochastic, dynamic process over time. (3) Regression 
coefficients are themselves dependent variables of separate regression models. 
Note that type 1 is strictly not a RCM, but we have mentioned it here because 
of its similarity to RCM. Types | and 2 together constitute the varying param- 
eter regression models. 

References for type 1 are Quandt (1958), Hinkley (1971), and Brown, 
Durbin, and Evans (1975). References for type 2 are Rosenberg (1973), 
Cooley and Prescott (1976), and Harvey (1978). A reference for type 3 is 
Amemiya (1978b). 


Exercises 


1. (Section 6.1.2) 
Consider a classic regression model 


y-ax 4 fz ^ u, 


where @ and f are scalar unknown parameters; x and z are 7-component 
vectors of known constants such that x/] = z'1 = 0, where l is a T-compo- 
nent vector of ones; and u is a 7-component vector of unobservable i.i.d. 

random variables with zero mean and unit variance. . Suppose weare given 
anestimator B such that Ef = fl, VB = T-', and Euf = T-"pl, where pis 
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a known constant with 0 £ |p| < 1. Write down the expression for the best 
estimator of o you can think of. Justify your choice. 


. (Section 6.1.2) 
In the model of Exercise 17 of Chapter 3, assume that the exact distribu- 
tion of f is known to be N(B, T). 

a. Obtain the mean squared error of &. 

b. Find an estimator of a whose mean squared error is smaller than 
that of a. 


. (Section 6.1.3) 
Prove that statement D = statement C in Theorem 6.1.1. 


. (Section 6.1.3) 
If K = 1 in Model 6, the efficiency of LS relative to GLS can be defined by 


z (x’xy 
(x/E-!x)x'Xx) 


Watson (1955) showed Eff = 44,4, /(4; + A}, where A, and A, are the larg- 
est and smallest characteristic roots of X, respectively. Evaluate this lower 
bound for the case where X is given by (5.2.9), using the approximation of 
the characteristic roots by the spectral density (cf. Section 5.1.3). 


. (Section 6.1.3) 

In Model 6 assume K = 1 and X = 1, a vector of ones. Also assume X is 
equal to Z, given in (5.2.9). Calculate the limit of the efficiency of LS as 
T — o. (Efficiency is defined in Exercise 4.) 


. (Section 6.1.3) 
Prove (6.1.6) directly, without using the fact that GLS is BLUE. 


. (Section 6.1.5) 
Consider a regression model 


y — Xf +u, 


where Eu = 0 and Eww’ = P = Z(Z/Z)-!Z'. We assume that X and Z are 
TXK and TXG matrices of constants, respectively, such that 
rank(X) = K, rank(Z) = G < T, and PX = X. Find a linear unbiased esti- 
mator of # the variance-covariance matrix of which is smaller than or 
equal to that of any other linear unbiased estimator. Is such an estimator 
unique? 


Eff 
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8. 


10. 


(Section 6.2) 

Suppose y — N(Xf, X), where there is no restriction on X except that it is 
positive definite. Can you obtain the MLE of 2 by setting the derivative of 
the log likelihood function with respect to X equal to 0? 


. (Section 6.3.2) 


Show that A, and A, given in (6.3.5) and (6.3.6) converge to 0 in probabil- 
ity under the assumption of the text. 


(Section 6.3.2) 
Combining y, = x;f + u, and u, = pu, , + €,, we can write 


Ve = PY + Xifl — pxi,B €. 
Durbin (1960) proposed estimating p by the least squares coefficient on 


Ji-, in the regression of y, on y, ,, Xy, and x,_,. In vector notation 


i= yal- Z(Z/Z) Zy 
yall- 2(Z’Z)'Z’Jy_,’ 


where y=(), Vases Vr)’, Y-1 = (Yos Ys- tc s Yr- Z= 
(X, X_,), X= (xi, X23. -3 X7), X-1 = (X0, X13- < ©» Xz)’. Show 
that Pp has the same asymptotic distribution as EL u,—144/2 2, u2, if 
lim;_.. T 'Z’Z is a finite nonsingular matrix. 


. (Section 6.3.3) 


In Model 6 suppose K = 1 and (uj) follow AR(1), u, = pu,_, + €,. If we 
thought (u,} were i.i.d., we would estimate the variance of LS B. by V= 
o?/x'x, where 9? = T—'[y’y — (x'x)- (y'xY]. But the true variance is 
V = x'Zx/(x'xY,, where X is as given in (5.2.9). What can you say about 
the sign of P — V? 


. (Section 6.3.3) 


Consider y, = fix, + u,, u, = pu, ., + €, where x, is nonstochastic, |p| < 1, 
{e} are i.i.d. with Ee, = 0, Ve, = o?, (u,) are stationary, and 2, p, and c? are 
unknown parameters. Given a sample (y,, x), = 1,2,..., T, and 
given xz, ,, what do you think is the best predictor of yr+,? 


. (Section 6.3.3) 


Let y, = fit - u, where {u,} follow AR(1), u,= pu,-, +€. Define the 
following two predictors of yr,,: $744 — (T + Dé and j74, =(T+ DA, 
where B and B are the LS and GLS estimators of f based on y, 


20. 


21. 
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yn-.-.-.,yr,respectively. In defining GLS, pis assumed known. Assum- 
ing p > 0, compare the two mean squared prediction errors. 


. (Section 6.3.5) 


Suppose Ey = X$ and Vy = X( £), meaning that the covariance matrix of 
y is a function of f. The true distribution of y is unspecified. Show that 
minimizing (y — Xf) Z( B)" (y — Xf) with respect to f yields an incon- 
sistent estimator, whereas minimizing (y — XB)'X(f) (y — Xf) + 
log|X(f)| yields a consistent estimator. Note that the latter estimator 
would be MLE if y were normal. 


. (Section 6.3.6) 


Show lim ,_... (Ed, — 2)? = 0, where d, is defined in (6.3.24). Use that to 
show plim,... dj = 2. 


. (Section 6.3.6) 


In the table of significance points given by Durbin and Watson (1951), 
dy — d, gets smaller as the sample size increases. Explain this phe- 
nomenon. 


. (Section 6.3.7) 


Verify (6.3.30). 


. (Section 6.3.7) 


Prove (6.3.33). 


. (Section 6.3.7) 


In the model (6.3.31), show plim p = ap(a + p)/(1 + ap) (cf. Malinvaud, 
1961). 


(Section 6.3.7) 

Consider model (6.3.27. Define the  T-vectonr x_,= 
(X9, Xis- -© , Xr-))' and the T X 2 matrix S = (x, x_,). Show that the 
instrumental variables estimator y = (S’Z)~'S’y is consistent and obtain 
its asymptotic distribution. Show that the estimator of p obtained by 
using the residuals y — Zy is consistent and obtain its asymptotic distri- 
bution. 


(Section 6.4) 

In the SUR model show that FGLS and GLS have the same asymptotic 
distribution as T goes to infinity under appropriate assumptions on u and 
X. Specify the assumptions. 
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22. 


23. 


24. 


25. 


26. 


(Section 6.5.1) 
Consider a regression model 
YF XB us, Uy = pu, F €, r=1,2,...,T7. 

Assume 

(A) x,isa vector of known constants such that Z7 xx; is a nonsingu- 
lar matrix. 

(B) u5-0. 

(C) (e) are independent with Ee, = 0 and Ve, = o2. 

(D) (e2)are known and bounded both from above and away from 0. 

(E) fand pare unknown parameters. 
Explain how you would estimate f. Justify your choice of the estimator. 


(Section 6.5.3) 

Show that the density of a multinomial model with more than two re- 
sponses or the density of a normal model where the variance is propor- 
tional to the square of the mean cannot be written in the form of (6.5.17). 


(Section 6.5.4) 
Show that VT (a, — a), where å is defined in (6.5.21), has the same limit 
distribution as VT(Z'Z)-!Z/v,. 


(Section 6.5.4) 
Show (6.5.25). 


(Section 6.5.4) 
Consider a heteroscedastic nonlinear regression model 


M = fo) + uy, 


where {u,} are independent and distributed as N(0, z;aQ), By is a scalar 
unknown parameter, and a; is a vector of unknown parameters. 
Assume _ 

(A) If f is the NLLS estimator of fy, VT (f — By) — N(0, c), where 
0«c«o. 

(B) {z} are vectors of known constants such that |z,|< h for some 
finite vector h (meaning that every element of |z,| is smaller than the 
corresponding element of h), z/a is positive and bounded away from 0, 
and T-7!27 zz; is nonsingular for every T and converges to a finite 
nonsingular matrix. 

(C) |af,/08|< M for some finite M for every t and every £. 
Generalize the Goldfeld-Quandt estimator ofa; to the nonlinear case and 
derive its asymptotic distribution. 


27. 


28. 
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(Section 6.6.2) 
Consider an error components model 
Vie = XB + My €i, i=1,2,.. QN and 
t=1,2,...,7, 
where x, is a K-vector of constants and y, and €; are scalar random 
variables with zero means. Define p= (4, l2,- - . HN), &= 
(Ens Ens- . ., €p), and €-—(€j,€5,. ..,ey). Then we assume 


Epp’ = oy, Eee’ = Ly © X, and Eve’ = 0, where X is a T-dimensional 
diagonal matrix the tth diagonal element of which is o? > 0.Assume c? 
and Z are known. Obtain the formulae for the following two estimators 
of p: 
1. The generalized least squares estimator bs; 
2. The fixed-effects estimator £ (it is defined here as the generalized 
least squares estimator of f treating u, as if they were unknown 
parameters rather than random variables). 


(Section 6.6.2) 
Consider the model y = Xf + vz + u, where 
T--1, observable 
TXK, known constants 
TX1, known constants 
KX1, unknown constants 
scalar, unobservable, Ev=0, Vv = g? 

u: 7X1, unobservable, Eu - 0, Euu’ = I. 
Assume v and u are independent and c? is a known constant. Also assume 
X'z * 0 and rank[X, z] = K + 1. Rank the following three estimators in 
terms of the mean squared error matrix: 

LS): (X'X)'X'y 


GLS(Ó;: (X'X-X)'X'Z-!y, where X- E(vz-- u)(vz + u) 


QLS(: (X'JX)'X'Jy, where Jo1-7- 


SPN MX 


7 Linear Simultaneous Equations Models 


In this chapter we shall give only the basic facts concerning the estimation of 
the parameters in linear simultaneous equations. À major purpose of the 
chapter is to provide a basis for the discussion of nonlinear simultaneous 
equations to be given in the next chapter. Another purpose is to provide a 
rigorous derivation of the asymptotic properties of several commonly used 
estimators. For more detailed discussion of linear simultaneous equations, the 
reader is referred to textbooks by Christ (1966) and Malinvaud (1980). 


7.1 Model and Identification 
We can write the simultaneous equations model as 
YT =XB+U (7.1.1) 


where Y is a T X N matrix of observable random variables (endogenous 
variables), X isa T X K matrix of known constants (exogenous variables), U is 
aT X N matrix ofunobservable random variables, and T and B are N X Nand 
K X N matrices of unknown parameters. We denote the 7, ith element of Y by 
Ya» the ith column of Y by y;, and the tth row of Y by yi}, and similarly for X 
and U. This notation is consistent with that of Chapter 1. 

Asan example ofthe simultaneous equations model, consider the following 
demand and supply equations: 


Demand: p, =Y, q: + xj, f, + un. 
Supply: q; = Y2Pı + xp + up. 


The demand equation specifies the price the consumer is willing to pay for 
given values of the quantity and the independent variables plus the error term, 
and the supply equation specifies the quantity the producer is willing to supply 
for given values of the price and the independent variables plus the error term. 
The observed price and quantity are assumed to be the equilibrium values that 
satisfy both equations. This is the classic explanation of how a simultaneous 
equations model arises. For an interesting alternative explanation in which 
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the simultaneous equations model is regarded as the limit of a multivariate 
time series model as the length of the time lag goes to 0, see articles by Strotz 
(1960) and Fisher (1970). 

We impose the following assumptions: 


ASSUMPTION 7.1.1. The sequence of N-vectors {ug} is i.i.d. with zero mean 
and an unknown covariance matrix X. (Thus EU = 0 and ET 'U’U = X.) 
We do not assume normality of {un }, although some estimators considered in 
this chapter are obtained by maximizing a normal density. 


ASSUMPTION 7.1.2. Rank of X is K, and lim T7! X' X exists and is nonsin- 
gular. 


ASSUMPTION 7.1.3. T is nonsingular. 


Solving (7.1.1) for Y, we obtain 


Y= XII + V, (7.1.2) 
where 


II = BI"! (7.1.3) 


and V = UT -!. We define A = I~" SI°—!. We shall call (7.1.2) the reduced 
form equations, in contrast to (7.1.1), which are called the structural equa- 
tions. 

We assume that the diagonal elements of I are ones. This is merely a 
normalization and involves no loss of generality. In addition, we assume that 
certain elements ofT and B are zeros.! Let — y, be the column vector consisting 
of those elements of the ith column ofT that are specified to be neither 1 norO, 
and let f; be the column vector consisting of those elements ofthe ith column 
of B that are not specified to be 0. Also, let Y, and X, be the subsets of the 
columns of Y and X that are postmultiplied by — y; and fj,, respectively. Then 
we can write the ith structural equation as 


y; ^ Yiy; + Xifl; + u, (7.1.4) 
zu Tu, 


We denote the number of columns of Y; and X, by N, and K;, respectively. 
Combining N such equations, we can write (7.1.1) alternatively as 


y=Za+u, (7.1.5) 
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where y=(y}, Yi EE YN) 
a= (ai, 05,. "E » ay)’, 
u= (ui, u,. t , Uy)’, 


and Z=diag(Z,,Z,,...,Zy). 


We define Q = Eun’ = X © L,. Note that (7.1.5) is analogous to the multi- 
variate regression model (6.4.2), except that Z in (7.1.5) includes endogenous 
variables. 

We now ask, Is o; identified? The precise definition of identification differs 
among authors and can be very complicated. In this book we shall take a 
simple approach and use the word synonymously with “existence ofa consist- 
ent estimator.” ? Thus our question is, Is there a consistent estimator of a,? 
Because there is a consistent estimator of II under our assumptions (for 
example, the least squares estimator II = (X' X)!X' Y will do), our question 
can be paraphrased as, Does (7.1.3) uniquely determine o; when II is deter- 
mined? 

To answer this question, we write that part of (7.1.3) that involves y; and 
B; as 

ta — May, B; (7.1.6) 
and 
Nn — ny; = 0. (7.1.7) 


Here, (2/,, 2%)” is the ith column of II, and (II;,, IT)’ is the subset of the 
columns of II that are postmultiplied by y;. The second subscript 0 or 1 
indicates the rows corresponding to the zero or nonzero elements of the ith 
column of B. Note that II, is a Kj X N, matrix, where Ki, = K — K;. From 
(7.1.7) it is clear that y; is uniquely determined if and only if 


rank(IIy) = N,. (7.1.8) 
This is called the rank condition of identifiability. It is clear from (7.1.6) that 


once y, is uniquely determined, f, is uniquely determined. For (7.1.8) to hold, 
it is necessary to assume 


KZ N;, (7.1.9) 


which means that the number of excluded exogenous variables is greater than 
or equal to the number of included endogenous variables. The condition 
(7.1.9) is called the order condition of identifiability.? 
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p Supply curve 


Demand curves with different 
values of the independent 
variables 


Figure 7.1 Demand and supply curves 


If (7.1.8) does not hold, we say a, is not identified or is underidentified. If 
(7.1.8) holds, and moreover, if Ki) = N;, we say a; is exactly identified or is 
just-identified. If (7.1.8) holds and Kj > N;, we say a; is overidentified. 

If B, # 0 and B, = 0 in the demand and supply model given in the beginning 
of this section, y; is identified but y, is not. This fact is illustrated in Figure 7.1, 
where the equilibrium values of the quantity and the price will be scattered 
along the supply curve as the demand curve shifts with the values of the 
independent variables. Under the same assumption on the fi's, we have 

II, 


B, and IL- B. (7.1.10) 


- — ol 

| 7n |n» 
where II, and II, are the coefficients on x, in the reduced form equations for p 
and q, respectively. From (7.1.10) it is clear that if f, consists of a single 
element, y; is exactly identified, whereas if fj, is a vector of more than one 
element, y, is overidentified. 


7.2 Full Information Maximum Likelihood Estimator 


In this section we shall define the maximum likelihood estimator of the 
parameters of model (7.1.1) obtained by assuming the normality of U, and we 
shall derive its asymptotic properties without assuming normality. We attach 
the term full information (FIML)to distinguish it from the limited information 
maximum likelihood (LIML) estimator, which we shall discuss later. 
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The logarithmic likelihood function under the normality assumption is 
given by 


log L = - Thos (2x) + T log IT ll — log 3 (7.2.1) 


E j tr 5- (YT — XB)'(YT — XB), 


where IIIi denotes the absolute value of the determinant of T. We define the 
FIML estimators as the values of Z, T, and B that maximize log L subject to 
the normalization on T and the zero restrictions on I and B. We shall derive its 
properties without assuming normality; once we have defined FIML, we shall 
treat it simply asan extremum estimator that maximizes the function given in 
(7.2.1). We shall prove the consistency and asymptotic normality of FIML 
under Assumptions 7.1.1, 7.1.2, and 7.1.3; in addition we shall assume the 
identifiability condition (7.1.8) and T= N + K. 

Differentiating log L with respect to X using the rules of matrix differentia- 
tion given in Theorem 21 of Appendix 1 and equating the derivative to 0, we 
obtain 


Z-T-W(YT —XB)' (YT — XB). (7.2.2) 
Inserting (7.2.2) into (7.2.1) yields the concentrated log likelihood function 


log L* —— i log (Y — XBT-!)'(Y — XBT-!)|, (7.2.3) 


where we have omitted terms that do not depend on the unknown parameters. 
The condition T = N + Kis needed because without it the determinant can be 
made equal to 0 for some choice of B and T. Thus the FIML estimators of the 
unspecified elements of B and T can be defined as those values that minimize 


S,—|T-«Y — XBT-!)'(Y — XBI-)). (7.2.4) 


Inserting Y = XB,I‘>! + V, where B, and I, denote the true values, into 
(7.2.4) and taking the probability limit, we obtain 


plim S7 = |A, + (B,I'5! - BY-!)’A(B,P's!—BI-')], — (72.5) 


where A = lim 7~'X’X. Moreover, the convergence is clearly uniform in B 
and T in the neighborhood of B, and T, in the sense defined in Chapter 4. 
Inasmuch as plim Sis minimized uniquely at B = B, and T = T, because of 
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the identifiability condition, the FIML estimators of the unspecified elements 
of T and B are consistent by Theorem 4.1.2. It follows from (7.2.2) that the 
FIML estimator of X is also consistent. 

The asymptotic normality of the FIML estimators of T and B can be proved 
by using Theorem 4.1.3, which gives the conditions for the asymptotic nor- 
mality of a general extremum estimator. However, here we shall prove it by 
representing FIML in the form of an instrumental variables estimator. The 
instrumental variables interpretation of FIML is originally attributable to 
Durbin (1963) and is discussed by Hausman (1975). 

Differentiating (7.2.1) with respect to the unspecified elements of B (cf. 
Theorem 21, Appendix 1) and equating the derivatives to 0, we obtain 


X'(YT — XB)X-! £0, (7.2.6) 


where & means that only those elements of the left-hand side of (7.2.6) that 
correspond to the unspecified elements of B are set equal to 0. The ith column 
ofthe left-hand side of (7.2.6) is X’ (YT — XB)o', where g'is the ith column of 
Z -!. Note that this is the derivative of log L with respect to the ith column of 
B. But, because only the K; elements of the ith column of B are nonzero, 
(7.2.6) is equivalent to 


X(YT — XB)c! — 0, i-1,2,...,N. (7.2.7) 
We can combine the N equations of (7.2.7) as 
X; 0 --.- 0 yı- Za 
0 X; Y2 — Z203 
. (xen . =0. (7.2.8) 
0 Xy yu — ZyQy 


Differentiating (7.2.1) with respect to the unspecified elements of I and 
equating the derivatives to 0, we obtain 


T(T'y! — Y'(YT — XB)£-! 5 0, (7.2.9) 


where = means that only those elements of the left-hand side of (7.2.9) that 
correspond to the unspecified elements of T are set equal to 0. Solving (7.2.2) 
for TI and inserting it into (7.2.9) yield 


Y'(YT — XB)X-! £0, (7.2.10) 
where Y = XBI' -!. In exactly the same way that we rewrote (7.2.6) as (7.2.8), 
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we can rewrite (7.2.10) as 


Y; 0:: 0 y, ~ Za, 
0 Y; Y2 — Z202 

ZON . =0, (7.2.11) 
0 YN Yu — ZyOty 


where Y, consists of the nonstochastic part of those vectors of Y that appear in 
the right-hand side of the ith structural equation. 


Defining Z, = (Y;, Xj) and Z = diag(Z,, Z2,. . . . , Zy), we can combine 
(7.2.8) and (7.2.11) into a single equation 
a = [Z' (2 GDZI!Z'(X-! @ Dy. (7.2.12) 


The FIML estimator of a, denoted à, is a solution of (7.2.12), where X is 
replaced by the right-hand side of (7.2.2). Because both Z and X in the right- 
hand side of (7.2.12) depend on a, (7.2.12) defines à implicitly. Nevertheless, 
this representation is useful for deriving the asymptotic distribution of FIML 
as well as for comparing it with the 3SLS estimator (see Section 7.4). 

Equation (7.2.12), with X replaced by the right-hand side of (7.2.2), can be 
used as an iterative algorithm for calculating &. Evaluate Z and X in the 
right-hand side of (7.2.12) using an initial estimate of a, thus obtaining a new 
estimate of a by (7.2.12). Insert the new estimate into the right-hand side, and 
so on. However, Chow (1968) found a similar algorithm inferior to the more 
standard Newton-Raphson iteration (cf. Section 4.4.1). a . 

The asymptotic normality of &, follows easily from (7.2.12). Let Z and X be 
Z and X evaluated at â, respectively. Then we have from (7.2.12) 


JT(á—o)- [TZ Ó- ONZ TĒ $3 6Dw — (9243 


Because & and Í are consistent estimators, we can prove, by a straightforward 
application of Theorems 3.2.7, 3.5.4, and 3.5.5, that under Assumptions 
7.1.1, 7.1.2, and 7.1.3 


VT (à — a) > N(0, [lim T7Z'(Z-' ODZ]'!). (7.2.14) 


7.3 Limited Information Model 
7.3.1 Introduction 


In this section we shall consider situations in which a researcher wishes to 
estimate only the parameters of one structural equation. Although these pa- 
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rameters can, of course, be estimated simultaneously with the parameters of 
the remaining equations by FIML, we shall consider simpler estimators that 
do not require the estimation of the parameters of the other structural equa- 
tions. 

We shall assume for simplicity that a researcher wishes to estimate the 
parameters of the first structural equation 


y,=Y,y+X,8+u,(]=Z,a+u,), (7.3.1) 


where we have omitted the subscript 1 from y, f, and a. We shall not specify 
the remaining structural equations; instead, we shall merely specify the re- 
duced form equations for Y,, 


Y; = XII, + Vi, (7.3.2) 


which is a subset of (7.1.2). The model defined by (7.3.1) and (7.3.2) can be 
regarded as a simplified simultaneous equations model in which simultaneity 
appears only in the first equation. We call this the limited information model. 
In contrast, we call model (7.1.1) the full information model. 

Assumptions 7.1.1 and 7.1.2 are still maintained. However, Assumption 
7.1.3 need not be assumed if we assume (7.3.2). Assumption 7.1.1 implies that 
the rows of (u,, V,) arei.i.d. with zero mean. Throughout this section we shall 
denote the variance-covariance matrix of each row of (u,, V,) by X. Partition 


X as 
X = lg! žu] 
X, X2] 
where ø? is the variance of each element of u,. We assume that a is identifi- 
able. It means that, if we partition II, as II, = (II;,, I{o)’ in such a way that X, 


is postmultiplied by IT,,, then the rank of ITj9 is equal to N, , the number of 
elements of y. 


7.3.2 Limited Information Maximum Likelihood Estimator 


The LIML estimator is obtained by maximizing the joint density of y, and Y, 
under the normality assumption with respect to a, II, and X without any 
constraint. Anderson and Rubin (1949) proposed this estimator (without the 
particular normalization on I we have adopted here) and obtained an explicit 
formula for the LIML estimator:* 


å, = [Zi (I - AM)Z]! Zi (1 - AM)y;, (7.3.3) 
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where A is the smallest characteristic root of W,W!, W= 
[i YL ]TMIy,, Yi]. W= [y,, YX] M: [y ¥:], M=I-X(X’Xy'X’, 


7.3.3 Two-Stage Least Squares Estimator 


The two-stage least squares (2SLS) estimator of a, proposed by Theil (1953),5 
is defined by 


Qos = (Zi PZ, y ' Zi Py,, (7.3.4) 
where P = X(X'X) !X. 


7.3.4 Asymptotic Distribution of the Limited Information Maximum 
Likelihood Estimator and the Two-Stage Least Squares Estimator 


The LIML and 2SLS estimators of a have the same asymptotic distribution. 
In this subsection we shall derive it without assuming the normality of the 
observations. 

We shall derive the asymptotic distribution of 2SLS. From (7.3.4) we have 


VT (Gos — o) = (TZ; PZ,)"' T-?Z; Puy. (7.3.5) 


The limit distribution of VT(& — a) is derived by showing that plim 
T-!Z; PZ, exists and is nonsingular and that the limit distribution of 
T-1?7j Pu, is normal. 

First, consider the probability limit of T7'!'Z;PZ,. Substitute 
(XII, + V,, Xj) for Z, in T—'Z{ PZ,. Then any term involving V, converges 
to 0 in probability. For example, 


plim 7-!X: X(X'X)-!X'V, = plim T-!X X(T-!X/X) T-!X'V, 
= plim 7X! X(plim 77!X'X)'! pim 7^X'V, — 0. 


The second equality follows from Theorem 3.2.6, and the third equality 
follows from plim 7—'X’V, = 0, which can be proved using Theorem 3.2.1. 
Therefore 


Th, 


II, Me] mee, | J 
I (plim T-!X'X) DH, 0 (7.3.6) 


plim T-!Z; PZ, = | 0 

=A. 
Furthermore, A is nonsingular because rank (ILo) = N, , which is assumed for 
identifiability. 
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Next, consider the limit distribution of T~'/?Z{Pu,. From Theorem 3.2.7 


we have 
1 1 | T,X’ 
qunm DS s 03D 


where = means that both sides of it have the same limit distribution. But, using 
Theorem 3.5.4, we can show 

[mx 

VTL Xi 


Thus, from Eqs. (7.3.5) through (7.3.8) and by using Theorem 3.2.7 again, 
we conclude 


VT (&, — a) > N(0, 0} A7). (7.3.9) 


| u, > N(0, oA). (7.3.8) 


To prove that the LIML estimator of aw has the same asymptotic distribution 
as 2SLS, we shall first prove 


plim VT (4 — 1) ^ 0. (7.3.10) 
We note 
__. 0 Wió 
A= min we? (7.3.11) 


which follows from the identity 

ow, 6 _ g WW, Wr 129 

ó"Wó nn , 
where 5 = W !?6, and from Theorems 5 and 10 of Appendix 1. Because 
Wi .(L —))Wk(L -7Y 


zs 
Cmn ewe (,—y)Wü,—yY" 
we have 
-WMU 7.3.12 
A-18 ur Mu, 1 ( ) 
— w[XOCX)?X' — X (X1 X)? Xt(]u, 
ui Mu, ` 


Therefore the desired result (7.3.10) follows from noting, for example, 
plim T- "u X(X'X)!X'u, = 0. From (7.3.3) we have 
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VT (á, — a) = [TZ PZ, — (à 1)T-'Z, MZ]! (7.3.13) 
X [T-'2Z4 Pu, — (4 — 1)7-'?Z: Mu, J. 


But (7.3.10) implies that both (A — 1)7-'Z{ MZ, and (A — DT?Z1 Mu, 
converge to 0 in probability. Therefore, by Theorem 3.2.7, 


VT (& — o) 9 VT (Z4 PZ, ) Z:Pu,, (7.3.14) 


which implies that the LIML estimator of a has the same asymptotic distribu- 
tion as 2SLS. 


7.3.5 Exact Distributions of the Limited Information Maximum Likelihood 
Estimator and the Two-Stage Least Squares Estimator 


The exact finite sample distributions of the two estimators differ. We shall 
discuss briefly the main results about these distributions as well as their ap- 
proximations. The discussion is very brief because there are several excellent 
survey articles on this topic by Mariano (1982), Anderson (1982), Taylor 
(1982), and Phillips (1983). 

In the early years (say, until the 1960s) most of the results were obtained by 
Monte Carlo studies; a summary of these results has been given by Johnston 
(1972). The conclusions of these Monte Carlo studies concerning the choice 
between LIML and 2SLS were inconclusive, although they gave a slight edge 
to 2SLS in terms of the mean squared error or similar moment criteria. 
However, most of these studies did not realize that the moment criteria may 
not be appropriate in view of the later established fact that LIML does not 
have a moment of any order and 2SLS has moments of the order up to and 
including the degree of overidentifiability (Kj, — N; in our notation). 

The exact distributions of the two estimators and their approximations 
have since been obtained, mostly for a simple two-equation model. Anderson 
(1982) summarized these results and showed that 2SLS exhibits a greater 
median-bias than LIML, especially when the degree of simultaneity and the 
degree of overidentifiability are large, and that the convergence to normality 
of 2SLS is slower than that of LIML. 

Another recent result favors LIML over 2SLS; Fuller (1977) proposed the 
modified LIML that is obtained by substituting A — c/(T — K;), where cis any 
constant, for À in (7.3.3). He showed that it has finite moments and dominates 
2SLS in terms of the mean squared error to O(T ?). Fuller’s estimator can be 
interpreted as another example of the second-order efficiency of the bias-cor- 
rected MLE (see Section 4.2.4). 
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However, we must treat these results with caution. First, both Anderson’s 
and Fuller’s results were obtained under the assumption of normality, 
whereas the asymptotic distribution can be obtained without normality. Sec- 
ond, Anderson’s results were obtained only for a simple model; and, more- 
over, it is difficult to verify in practice exactly how large the degrees of simulta- 
neity and overidentifiability should be for LIML to dominate 2SLS. Third, we 
should also compare the performance of estimators under misspecified 
models. Monte Carlo studies indicate that the simpler the estimator the more 
robust it tends to be against misspecification. (See Taylor, 1983, for a critical 
appraisal of the finite sample results.) 


7.8.6 Interpretations of the Two-Stage Least Squares Estimator 


There are many interpretations of 2SLS. They are useful when we attempt to 
generalize 2SLS to situations in which some of the standard assumptions of 
the model are violated. Different interpretations often lead to different gener- 
alizations. Here we shall give four that we consider most useful. 

Theil’s interpretation is most famous, and from it the name 2SLS is derived. 
In the first stage, the least squares method is applied to Eg. (7.3.2) and the least 
squares predictor Y, = PY, is obtained. In the second stage, Y, is substituted 
for Y, in the right-hand side of (7.3.1) and y, is regressed on Y, and X,. The 
least squares estimates of y, and $, thus obtained are 2SLS. 

The two-stage least squares estimator can be interpreted as the asymptoti- 
cally best instrumental variables estimator. This is the way Basmann (1957) 
motivated his estimator, which is equivalent to 2SLS. The instrumental vari- 
ables estimator applied to (7.3.1) is commonly defined as (S'Z,) ! S'y, for 
some matrix S of the same size as Z,. But, here, we shall define it more 
generally as 


â = (Zi P5Z;) ! Zi Psyi, (7.3.15) 


where Ps = S(S'S) ! S’. The matrix S should have T rows, but the number of 
columns need not be the same as that of Z,. In addition, we assume that S 
satisfies the following three conditions: 
(i) plim T^! S'S exists and is nonsingular, 

(ii) plim 7-!S'u, = 0, and 

(ii) plim T^!S'V, — 0. 
Under these assumptions we obtain, in a manner analogous to the derivation 
of (7.3.9), 


YT(á, — a) ^ NO, e2C-!), (7.3.16) 
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where 


C= [m Hi plim T-! X'P,X [m ‘| (7.3.17) 
10 


Because A = C, we conclude by Theorem 17 of Appendix 1 that 2SLS is the 
asymptotically best in the class of the instrumental variables estimators de- 
fined above.’ 

For a third interpretation, we note that the projection matrix P 
used to define 2SLS in (7.3.4) purports to eliminate the stochastic 
part of Y,. If V, were observable, the elimination of the stochastic part of Y, 
could be more directly accomplished by another projection matrix 
My, =I— V, (Vi Vi) ! Vi. The asymptotic variance-covariance matrix of the 
resulting estimator is (02 — 2,227272; )A^! and hence smaller than that of 
2SLS. When we predict V, by MY, , where M = I — X(X'X) ! X’, and use it 
in place of My,, we obtain 2SLS. 

The last interpretation we shall give originates in the article by Anderson 
and Sawa (1973). Let the reduced form for y, be 


y 7 Xn tw. (7.3.18) 
Then (7.3.1), (7.3.2), and (7.3.18) imply 

nm, 7 1l y * J;f, (7.3.19) 
where J, = (X'X)! X' X,. From (7.3.19) we obtain 

f, =Â, y +J, 8+ (£, — 2) — (Â, — 1,5, (7.3.20) 


where 7, and Ii are the least squares estimators of zt, and II, , respectively. 
Then 2SLS can be interpreted as generalized least squares applied to (7.3.20). 
To see this, merely note that (7.3.20) is obtained by premultiplying (7.3.1) by 
(X'X)'!X'. 


7.3.7 Generalized Two-Stage Least Squares Estimator 
Consider regression equations 

y=Zatu (7.3.21) 
and 

Z-XII- V, (7.3.22) 


where Eu = 0, EV — 0, Euu' = VP, and u and V are possibly correlated. We 
assume that VP is a known nonsingular matrix. Although the limited informa- 
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tion model can be written in this way by assuming that certain elements of V 
are identically equal to 0, the reader may regard these two equations in a more 
abstract sense and not necessarily as arising from the limited information 
model considered in Section 7.3.1. 

Premultiplying (7.3.21) and (7.3.22) by ‘¥~'/2, we obtain 


Wry = Y-12 Za + Y-'2u (7.3.23) 
and 
W-127 = yr- 12 XII + Y-712vy, (7.3.24) 


We define the G2SLS estimator of a as the 2SLS estimator of a applied to 
(7.3.23) and (7.3.24); that is, 


Ĝon = [ZW X(X P- X)! X Y-I Z Y-I X(X'W-1X) IX Bly, 
(7.3.25) 


Given appropriate assumptions on P7! X, XP^!2y, and ¥-!V, we can 
show 


VT (Ggas — a) > N[0, (lim T7 TI X^! XID)!]. (7.3.26) 


Asin Section 7.3.6, we can show that G2SLS is asymptotically the best instru- 
mental variables estimator in the model defined by (7.3.21) and (7.3.22). The 
limit distribution is unchanged if a regular consistent estimator of ¥ is substi- 
tuted. 

The idea of G2SLS is attributable to Theil (1961), who defined it in another 
asymptotically equivalent way: (Z'P*P^! Z)!Z'PXP^!y, It has the same 
asymptotic distribution as ozs. 


7.4 Three-Stage Least Squares Estimator 


In this section we shall again consider the full information model defined by 
(7.1.1). The 3SLS estimator of o in (7.1.5) can be defined as a special case of 
G2SLS applied to the same equation. The reduced form equation comparable 
to (7.3.22) is provided by 


Z- XII * V, (74.1) 
where X=I@X, II-diag[(IL, Ji(IL, J), . . (II, Ja); V— 
diag[(V,, 0),(V2, 0), . . . , (Vy, 0)], and J, = (X X)! X X,. 


To define 3SLS (proposed by Zellner and Theil, 1962), we need a consistent 
estimator of X, which can be obtained as follows: 
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Step 1. Obtain the 2SLS estimator of aj, i= 1,2,. . . , N. 

Step 2. Calculate à, = y,— Zíà;,i71,2,..., N. 

Step 3. Estimate oy by 6; = T~' 0/6, 
Next, inserting Z — Z, X — X, and ya $ G I into (7.3.25), we have after 
some manipulation of Kronecker products (see Theorem 22 of Appendix 1). 


diss = [Z $^! © Zr! z/$-! 6 Phy (7.4.2) 
-[2/$-6n2]2/$- © Dy, 
where 2 = diag(Z,, y D. , Zu) and 2,- PZ,. The second formula of 


(7.4.2)is similarto the instrumental variables representation of FIML given in 
(7.2.12). We can make use of this similarity and prove that 3SLS has the same 
asymptotic distribution as FIML. For exact distributions of 3SLS and FIML, 
see the survey articles mentioned in Section 7.3.5. 


7.5 Further Topics 


The following topics have not been discussed in this chapter but many impor- 
tant results in these areas have appeared in the literature over the past several 
years: (1) lagged endogenous variables included in X; (2) serially correlated 
errors; (3) prediction; and (4) undersized samples (that is, X having more 
columns than rows). Harvey (1981a) has given a detailed discussion of the first 
two topics. Recent results concerning the third topic can be found in articles 
by Sant (1978) and by Nagar and Sahay (1978). For a discussion of the fourth 
topic, consult the articles by Brundy and Jorgenson (1974) and by Swamy 
(1980). 


Exercises 


1. (Section 7.3.1) 

In the limited information model defined by (7.3.1) and (7.3.2), let 
X = (X,, X,) where X, and X, have K, and K, columns, respectively. 
Suppose we define a class of instrumental variables estimators of a by 
(S'Z,) !S'y, where S = (XA, X,) with A being a K, X N, (K, z Nj) 
matrix of constants. Show that there exists some A for which the instru- 
mental variables estimator is consistent if and only if the rank condition of 
identifiability is satisfied for Eq. (7.3.1). 


2. (Section 7.3.3) 
Show that the LIML estimator of y is obtained by minimizing 
ô’ W, 0/6’ Wo, where ô = (1, —’)’, with respect to y. Hence the estimator 
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is sometimes referred to as the least variance ratio estimator. Also, show 
that the minimization of ó^ W, ô — 6’ W ô yields the 2SLS estimator of y. 


. (Section 7.3.3) 
Consider the model 
y-YytX)ftu-Zatu 
Y=XII+V=Y+V, 
where 
JEU, VY (u, V) -X- B ža], 
Obtain the asymptotic variance-covariance matrix of à = (Y’Y)"'Y’y 
and compare it with that of the 2SLS estimator of a. 
. (Section 7.3.3) 
In a two-equation model 
yy7 yy tu, 
Yo = yi + fixa + Ax, + us, 
compute the LIML and 2SLS estimates of y, , given the following moment 


matrices 
~w_| 1 0 e 1i2 w|i l 
xx=[ j a vy-[3 4! xv-|1 i], 


where X = (x,, x2) and Y = (y1, yj). 


. (Section 7.3.7) 
Show that Theil's G2SLS defined at the end of Section 7.3.7 is asymptoti- 
cally equivalent to the definition (7.3.25). 


. (Section 7.3.7) 

Define & = [Z/'X(X FX) !X'Z] 'Z'X(X"PX)"!X'y. Show that this is a 
consistent estimator of œ in model (7.3.21) but not as asymptotically as 
efficient as &5, defined in (7.3.25). 


. (Section 7.4) 

Suppose that a simultaneous equations model is defined by (7.1.5) and the 
reduced form Z = [I © X]II + V. Show that &;s defined in (7.3.25) and 
Theil's G2SLS defined at the end of Section 7.3.7 will lead to the same 
3SLS when applied to this model. 
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8. (Section 7.4) 
Consider the following model: 
(1) yı=yy tu 
(2) y. fx, + fx, + u; — Xf + u, 
where y, f, , and fj; are scalar unknown parameters, x, and x, are 7-compo- 
nent vectors of constants such that T^ !X'X is nonsingular for every T and 
also in the limit, y, and y, are 7-component vectors of observable random 
variables, and u, and u, are 7-component vectors of unobservable random 
variables that are independently and identically distributed with zero mean 
and contemporaneous variance-covariance matrix 


>= [2 1 A 
o; J 

a. Prove that the 3SLS estimator of y is identical with the 2SLS estima- 
tor of y. 

b. Transform Eqs. (1) and (2) to obtain the following equivalent model: 
3) 5-ly-l 

Y2 y yi y u; 

(4) y, 7 »fixi + ypx: + yu; + u,. 
Define the reverse 2SLS estimator of y as the reciprocal ofthe 2SLS estima- 
tor of 1/7 obtained from Eqs. (3) and (4). Prove that the reverse 2SLS 
estimator of y has the same asymptotic distribution as the 2SLS estimator 
of y. 

c. Assume that at period p outside the sample the following relation- 
ship holds: 
(5) Yip = Wop t ty 
(6) Yap = Bip + B», + Ur, 7 xy + Up; 
where 1, and u, are independent of u, and u, with zero mean and var- 
iance-covariance matrix X. We want to predict y,, when x,, and x, are 
given. Compare the mean squared prediction error of the indirect least 
squares predictor, defined by 
() Sip = XXX) IX’, 
with the mean squared prediction error of the 2SLS predictor defined by 
(8) Dip = FAX XVX’ ya, 
where yis the 2SLS estimator of y. Can we say one is uniformly smaller than 
the other? 


8 Nonlinear Simultaneous Equations Models 


In this chapter we shall develop the theory of statistical inference for nonlinear 
simultaneous equations models. The main results are taken from the author’s 
recent contributions (especially, Amemiya, 1974c, 1975a, 1976a, 1977a). 
Some additional results can be found in Amemiya (1983a). Section 8.1 deals 
with the estimation of the parameters ofa single equation and Section 8.2 with 
that of simultaneous equations. Section 8.3 deals with tests of hypotheses, 
prediction, and computation. 


8.1 Estimation in a Single Equation 
8.1.1 Nonlinear Two-Stage Least Squares Estimator 
In this section we shall consider the nonlinear regression equation 
Y= f(Y,, Xi, 09) + u,, t=1,2,...,T7, (8.1.1) 


where y, is a scalar endogeneous variable, Y, is a vector of endogenous vari- 
ables, X,, is a vector of exogenous variables, œ is a K-vector of unknown 
parameters, and {u,} are scalar i.i.d. random variables with Eu, — 0 and 
Vu, = 07, This model does not specify the distribution of Y,. Equation (8.1.1) 
may be one of many structural equations that simultaneously define the 
distribution of y, and Y,, but here we are not concerned with the other equa- 
tions. Sometimes we shall write /(Y,, X,,, a) simply as f,(a) or as f,. We 
define T-vectors y, f, and u, the fth elements of which are y,, fı, and u,, 
respectively, and matrices Y and X,, the ‘th rows of which are Y; and Xj, 
respectively.! 

The nonlinear least squares estimator of aj in this model is generally incon- 
sistent for the same reason that the least squares estimator is inconsistent in a 
linear simultaneous equations model. We can see this by considering (4.3.8) 
and noting that plim A, # 0 in general because f, may be correlated with u, in 
the model (8.1.1) because of the possible dependence of Y, on y,. In this 
section we shall consider how we can generalize the two-stage least squares 
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(2SLS) method to the nonlinear model (8.1.1) so that we can obtain a consist- 
ent estimator. 

Following the article by Amemiya (1974c), we define the class of nonlinear 
two-stage least squares (NL2S) estimators of œ in the model (8.1.1) as the 
value of œ that minimizes 


S(a|W) = (y — fy W(W'W) "Wy —f), (8.1.2) 


where W is some matrix of constants with rank at least equal to K. 

In the literature prior to the article by Amemiya (1974c), a generalization of 
2SLS was considered only in special cases of the fully nonlinear model (8.1.1), 
namely, (1) the case of nonlinearity only in parameters and (2) the case of 
nonlinearity only in variables. See, for example, the article by Zellner, Huang, 
and Chau (1965) for the first case and the article by Kelejian (1971) for the 
second case.? The definition in the preceding paragraph contains as special 
cases the Zellner-Huang-Chau definition and the Kelejian definition, as well 
as Theil's 2SLS. By defining the estimator as a solution of a minimization 
problem, it is possible to prove its consistency and asymptotic normality by 
the techniques discussed in Chapter 4. 

First, we shall prove the consistency of NL2S using the general result for an 
extremum estimator given in Theorem 4.1.2. The proof is analogous to but 
slightly different from a proof given by Amemiya (1974c). The proof differs 
from the proof of the consistency of NLLS (Theorem 4.3.1) in that the deriva- 
tive of f is used more extensively here. 


THEOREM 8.1.1. Consider the nonlinear regression model (8.1.1) with the 
additional assumptions: 
(A) lim T^!'W"W exists and is nonsingular. 
(B) 6f,/8a exists and is continuous in N(ag), an open neighborhood of a. 
(C) T 'W'(8f/óa) converges in probability uniformly in œ € N(œ). 
(D) plim 7~'W’(6f/da’),,, is full rank. 
Then a solution of the minimization of (8.1.2) is consistent in the sense of 
Theorem 4.1.2. 


Proof. Inserting (8.1.1) into (8.1.2), we can rewrite T~! times (8.1.2) as 
TS: = Tw Pyu + T^f; — fP,(f, — f) (8.1.3) 
+ T"'2(fp — fyP,u 
=A,+A,+ A5, 
where Py = W(W^W)-'W', f = f(a), and f, = f(a). Note that (8.1.3) is simi- 
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lar to (4.3.8). First, A, converges to 0 in probability because of assumption A 
and the assumptions on (u,). Second, consider 4, . Assumption B enables us to 
write 

f — f; + G,(a— a), (8.1.4) 
where G, is the matrix the tth row of which is ôf, /da* evaluated at a? between 
a and ag. Therefore we have 

A, = T (a, — ay Gi P wG, (ay — a). (8.1.5) 


Therefore, because of assumptions C and D, A, converges in probability 
uniformly in œ € N(a,) to a function that is uniquely minimized at ay. 
Finally, consider A}. We have by the Cauchy-Schwartz inequality 


T-'|(f; — f)P,u|- T-'|(a; — aS, Pul (8.1.6) 
5 [T-a — ay G, PG, (a, — a)]? 
X [T7 v'P,,u]'. 
Therefore the results obtained above regarding A, and A, imply that 4, con- 


verges to 0 in probability uniformly in œ € N(a@). Thus we have verified all 
the conditions of Theorem 4.1.2. 


Next, we shall prove the asymptotic normality of NL2S by verifying the 
conditions of Theorem 4.1.3. 


THEOREM 8.1.2. In addition to the assumptions of the nonlinear regression 
model (8.1.1) and those of Theorem 8.1.1, make the following assumptions: 
(A) &f,/dada’ exists and is continuous in a € N(as). 
(B) T-'W'(:f/9o90/) converges in probability to a finite matrix uni- 
formly in a € N(a,), where o; is the ith element of the vector a. 
Then, if we denote a consistent solution of the minimization of (8.1.2) by à, 
we have 


VT (& — a) > N(0, c?[plim 7^ 'GoP,G,] !), 
where G, is óf/óa' evaluated at a. 
Proof. First, consider condition C of Theorem 4.1.3. We have 


1 9$,] | 2 
VT ĉa la, T 
But, using Theorems 3.5.4 and 3.5.5, /T(W'WyW'u — N[0, c? 


GoW - VT (WW) Wt. (8.1.7) 
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lim T(W'W)-!] because of assumption A of Theorem 8.1.1 and the assump- 
tions on {u,}. Therefore, because of assumption D of Theorem 8.1.1 and 
because of Theorem 3.2.7, we have 


1 aS; 


VT 9a 


Second, consider assumption B of Theorem 4.1.3. We have, for any & such 
that plim & = a, 


— N(0, 4c? plim GSP yo). (8.1.8) 


Oto 


1 0$, 22 £P 
T dader |; due (8.1.9) 
1, i ; ww 
[rw - 1a ape [V] 
Do, Of 
x[zw Goa Jl 


where( }in the second term ofthe right-hand side is the matrix the ith row of 
which is given inside(  ), G is of/dar’ evaluated at &, and G4 is the matrix the 
tth row of which is df,/da’ evaluated at at between & and a,. But the term 
inside ( ) converges to 0 in probability because of assumptions A and C of 
Theorem 8.1.1 and assumption B of this theorem. Therefore assumptions A 
and C of Theorem 8.1.1 imply 


G;P,G,. (8.1.10) 


Finally, because assumption A of this theorem implies assumption A of 
Theorem 4.1.3, we have verified all the conditions of Theorem 4.1.3. Hence, 
the conclusion of the theorem follows from (8.1.8) and (8.1.10). 


Amemiya (1975a) considered, among other things, the optimal choice of 
W. It is easy to show that plim 7(GoP,,G,) ! is minimized in the matrix sense 
(that is, A > B means A — B is a positive definite matrix) when we choose 
W =G = EG). We call the resulting estimator the best nonlinear two-stage 
least squares (BNL2S) estimator. The asymptotic covariance matrix of VT 
times the estimator is given by 


Va = 6? plim T(G/G)-!. (8.1.11) 


However, BNL2S is not a practical estimator because (1) it is often difficult to 
find an explicit expression for G, and (2) G generally depends on the unknown 
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parameter vector a. The second problem is less serious because a may be 
replaced by any consistent member of the NL2S class using some W. 

Given the first problem, the following procedure recommended by 
Amemiya (1976a) seems the best practical way to approximate BNL2S: 

Step 1. Compute à, a member of the NL2S class. 

Step 2. Evaluate G = óf/ào' at à— call it G. 

Step 3. Treat G as the dependent variables of regressions and search for the 
optimal set of independent variables, denoted W,, that best predict G. 

Step 4. Set W = Wp. 
If we wanted to be more elaborate, we could search for a different set of 
independent variables for each column of G (say, W; for the ith column g,) 
and set W = [Paĝ P &;, . . . , Po, kl. u 

Kelejian (1974) proposed another way to approximate G. He proposed this 
method for the model that is nonlinear only in variables, but it could also work 
for certain fully nonlinear cases. Let the tth row of G, be Gj,, that is, Go, = 
(8f,/00' ),,. Then, because Go, is a function of Y, and a, it is also a function of 
u, and œ; therefore write G,(u,, a). Kelejian's suggestion was to generate u, 
independently n times by simulation and approximate EG), by n^! XL, 
G,(u,;, &), where à is some consistent estimator of a. Kelejian also pointed 
out that G,(0, â) is also a possible approximation for EG,; although it is 
computationally simpler, it is likely to be a worse approximation than that 
given earlier. 


8.1.2 Box-Cox Transformation 
Box and Cox (1964) proposed the model? 

z(A) = xy + u,, (8.1.12) 
where, for y, > 0, 


A 
a= if A#0 (8.1.13) 


= log y, if A=0. 


Note that because lim; ,o(y? — 1)/A = log y,, z,(A) is continuous at A = 0. It is 
assumed that (u,) are i.i.d. with Eu, = 0 and Vu, = 0?. 

The transformation z,(A) is attractive because it contains y, and log y, as 
special cases, and therefore the choice between y, and log y, can be made 
within the framework of classical statistical inference. 
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Box and Cox proposed estimating A, f, and a? by the method of maximum 
likelihood assuming the normality of u,. However, u, cannot be normally 
distributed unless 4 = 0 because z,(A) is subject to the following bounds: 


POES if A>0 (8.1.14) 


if A«0. 


Later we shall discuss their implications on the properties of the Box- 
Cox MLE. 

This topic is relevant here because we can apply NL2S to this model with 
only a slight modification. To accommodate a model such as (8.1.12), we 
generalize (8.1.1) as 


f(y,, Y,, X, a) = u. (8.1.15) 


Then all the results of the previous section are valid with only an apparent 
modification. The minimand (8.1.2), which defines the class of NL2S estima- 
tors, should now be written as 


f^W(WW)-!W', (8.1.16) 


and the conclusions of Theorem 8.1.1 and 8.1.2 remain intact. 
Define the NL2S estimator (actually, a class of estimators) of a = (A, B’)’, 
denoted &, in the model (8.1.12) as the value of æ that minimizes 


[z(A) — X8 WCW^W) *W'[z(4) — X8]. (8.1.17) 


where z(A) is the vector the tth element of which is z,(4). The discussion about 
the choice of W given in the preceding section applies here as well. One 
practical choice of W would be to use the x's and their powers and cross 
products. Using arguments similar to the ones given in the preceding section, 
we can prove the consistency and the asymptotic normality of the estimator. 
The asymptotic covariance matrix of VT (à — a) is given by 


az’ =) 
a . OA oz 
V(à) = a? plim T aA wW wyw’ B -x| . (8.1.18) 
—X' 


The Box-Cox maximum likelihood estimator is defined as the pseudo MLE 
obtained under the assumption that {u,} are normally distributed. In other 
words, the Box-Cox MLE 8 of 0 = (A, f, o?) maximizes 
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-Iwo Vi — hye PERV 
log L z logo Dat Dy ^ Bx, +A 1) X log y». 


(8.1.19) 


Because u, cannot in reality be normally distributed, (8.1.19) should be re- 
garded simply as a maximand that defines an extremum estimator rather than 
a proper logarithmic likelihood function. This fact has two implications: 

1. We cannot appeal to the theorem about the consistency of the maximum 
likelihood estimator; instead, we must evaluate the probability limit as the 
value of the parameter at which lim 7"! E log L is maximized. 

2. The asymptotic covariance matrix of VT T(6— 0) is not equal to the usual 
expression —lim T[E@ log L/0000' |"! but is given by 


log L|" _ d log L 2 | 2-21 
3030 80 3# 8000" | ' 


(8.1.20) 


V(6) — lim T E 


where the derivatives are evaluated at plim 8. 

Thus, to derive the probability limit and the asymptotic covariance matrix 
of 0, we must evaluate E log L and the expectations that appear in (8.1.20) by 
using the true distribution of u,. 

Various authors have studied the properties of the Box-Cox MLE under 
various assumptions about the true distribution of u,. Draper and Cox (1969) 
derived an asymptotic expansion for E à log L/dA in the model without re- 
gressors and showed that it is nearly equal to 0 (indicating the approximate 
consistency of the Box-Cox MLE A) if the skewness of z,(A) is small. Zarembka 
(1974) used a similar analysis and showed E à log L/dA + 0 if (u,) are hetero- 
scedastic. Hinkley (1975) proposed for the model without regressors an inter- 
esting simple estimator of 4, which does not require the normality assump- 
tion; he compared it with the Box-Cox MLE 4 under the assumption that y, is 
truncated normal (this assumption was also suggested by Poirier, 1978). 
Hinkley showed by a Monte Carlo study that the Box-Cox MLE À and B 
perform generally well under this assumption. Amemiya and Powell (1981) 
showed analytically that if y, is truncated normal, plim 4 < 4 if A> 0 and 
plim A» Aif A <0. 

Amemiya and Powell (1981) also considered the case where y, follows a 
two-parameter gamma distribution. This assumption actually alters the 
model as it introduces a natural heteroscedasticity in {u,}. In this case, the 
formula (8.1.18) is no longer valid and the correct formula is given by 


V(&) = plim T(Z/P,Z) 'Z/P,,DP,Z(Z/P,,Z)", (8.1.21) 
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where Z = (62/04, — X) and D = Ew. Amemiya and Powell considered a 
simple regression model with one independent variable and showed that 
NL2S dominates the Box-Cox MLE for most of the plausible parameter 
values considered not only because the inconsistency of the Box-Cox MLE is 
significant but also because its asymptotic variance is much larger. 

There have been numerous econometric applications of the Box-Cox MLE. 
In the first of such studies, Zarembka (1968) applied the transformation to the 
study of demand for money. Zarembka’s work was followed up by studies by 
White (1972) and Spitzer (1976). For applications in other areas of economics, 
see articles by Heckman and Polachek (1974) (earnings - schooling relation- 
ship); by Benus, Kmenta, and Shapiro (1976) (food expenditure analysis); and 
by Ehrlich (1977) (crime rate). 

All of these studies use the Box-Cox MLE and none uses NL2S. In most of 
these studies, the asymptotic covariance matrix for Bis obtained as if z,(A) were 
the dependent variable of the regression—a practice that can lead to gross 
errors. In many of the preceding studies, independent variables as well as the 
dependent variable are transformed by the Box-Cox transformation but possi- 
bly with different parameters. 


8.1.3 Nonlinear Limited Information Maximum Likelihood Estimator 


In the preceding section we assumed the model (8.1.1) without specifying the 
model for Y, or assuming the normality of u, and derived the asymptotic 
distribution of the class of NL2S estimators and the optimal member of the 
class — BNL2S. In this section we shall specify the model for Y, and shall 
assume that all the error terms are normally distributed; under these assump- 
tions we shall derive the nonlinear limited information maximum likelihood 
(NLLI) estimator, which is asymptotically more efficient than BNL2S. The 
NLLI estimator takes advantage of the added assumptions, and consequently 
its asymptotic properties depend crucially on the validity of the assumptions. 
Thus we are aiming at a higher efficiency at the possible sacrifice of robustness. 
Assume, in addition to (8.1.1), 


Y; = XII + V/, (8.1.22) 


where V, is a vector of random variables, X,is a vector of known constants, and 
TI is a matrix of unknown parameters. We assume that (u,, V;) are indepen- 
dent drawings from a multivariate normal distribution with zero mean and 
variance-covariance matrix 


2 2 
z= Iz gal, (8.1.23) 
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We define X and V as matrices the ‘th rows of which are X; and V;, respec- 
tively. Because u and V are jointly normal, we can write 

v= Viz 012 + € (8.1.24) 
where € is independent of V and distributed as N(0, o*1), where o*? = o? — 
01271012. 

The model defined in (8.1.24) may be regarded either as a simplified non- 
linear símultaneous equations model in which both the nonlinearity and the 
simultaneity appear only in the first equation or as the model that represents 
the "limited information" of the investigator. In the latter interpretation, X, 
are not necessarily the original exogenous variables of the system, some of 
which appear in the arguments of f, but, rather, are the variables a linear 
combination of which the investigator believes will explain Y, effectively. 

Because the Jacobian of the transformation from (u, V)to (y, Y) is unity in 
our model, the log likelihood function assuming normality can be written, 
apart from a constant, as 


p" = -7 log|X|— ; tr X-!Q, (8.1.25) 


where 


uu u'V 
Q= ls M 
uand V representing y — fand Y — XII, respectively. Solving9L**/0X = 0 for 
X yields 
ZE-T-Q. (8.1.26) 


Substituting (8.1.26) into (8.1.25), we obtain a concentrated log likelihood 
function 


i= -7 (log wu + log |V’M, V)). (8.1.27) 
Solving ôL*/ð[l = 0 for II, we obtain 
TI = (X'M,X)'!X'M,Y, (8.1.28) 


where M, = I — (u’u)~'uu’. Substituting (8.1.28) into (8.1.27) yields a further 
concentrated log likelihood function 


= -$ (log wu + log|Y'M,Y — Y'M,X(X'M,X)"!X'M,YI), 
(8.1.29) 
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which depends only on a. Interpreting our model as that which represents the 
limited information of the researcher, we call the value of a that minimizes 
(8.1.29) the NLLI estimator. The asymptotic covariance matrix of VT times 
the estimator is given by 

-1 


V, = plim r|; -LGMyG - (s -5)ew«|. (8.1.30) 


where M, = I — V(V'V)-!V' and My = I — X(X'X)-!X'. 

The maximization of (8.1.29) may be done by the iterative procedures 
discussed in Section 4.4. Another iterative method may be defined as follows: 
Rewrite (8.1.27) equivalently as 


L*-— -5 (log 'M,u + log |V'V|) (8.1.31) 


and iterate back and forth between (8.1.28) and (8.1.31). That is, obtain 
= (X'X)-!X'Y and V = Y — XII, maximize (8.1.31) with respect to a after 
replacing V with V, call this estimator & and define à = y — f(a), insert it into 
(8.1.28) to obtain another estimator of II, and repeat the procedure until 
convergence. 
The estimator @ defined in the preceding paragraph is interesting in its own 
right. It is the value of a that minimizes 


(y - fl — My¥(Y’M,Y)"' YM, ](y — f). (8.1.32) 


Amemiya (1975a) called this estimator the modified nonlinear two-stage 
least squares (MNL2S) estimator. The asymptotic covariance matrix of 

YT (à — a) is given by 
Vu = plim 7(G'M,G) [o G'M,G (8.1.33) 

+ (a? — c*2)G'P4G](G'M,G) È. 

Amemiya (1975a) proved V, < V, < V». It is interesting to note that if f is 
linear in à and Y, MNL2S is reduced to the usual 2SLS (see Section 7.3.6). 
In Sections 8.1.1 and 8.1.3, we discussed four estimators: (1) NL2S (as a 


class); (2) BNL2S; (3) MNL2S; (4) NLLI. If we denote NL2S(W = X) by 
SNL2S (the first S stands for standard), we have in the linear case 


SNL2S = BNL2S = MNL2S = NLLI, (8.1.34) 


where = means exact identity and = means asymptotic equivalence. In the 
nonlinear model defined by (8.1.1) and (8.1.22) with the normality assump- 
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tion, we can establish the following ranking in terms of the asymptotic covar- 
iance matrix: 


SNL2S « BNL2S « MNL2S < NLLI, (8.1.35) 


where < means “is worse than." However, it is important to remember that 
the first two estimators are consistent under more general assumptions than 
those under which the last two estimators are consistent, as we shall show in 
the following simple example. 

Consider a very simple case of (8.1.1) and (8.1.22) given by 


y= az? + u, (8.1.36) 
and 
Zi = nX, t uU, (8.1.37) 


where we assume the vector (u,, v,) is iid. with zero mean and a finite 
nonsingular covariance matrix. Inserting (8.1.37) into (8.1.36) yields 


y, = an?x? + œo? + (u, + 2anx,v, + av? — ac?), (8.1.38) 


where the composite error term (contained within the parantheses) has zero 
mean. In this model, SNL2S is 2SLS with z? regressed on x, in the first stage, 
and BNL2S is 2SLS with z? regressed on the constant term and x? in the first 
stage. Clearly, both estimators are consistent under general conditions with- 
out further assumptions on u, and v,. On the other hand, it is not difficult to 
show that the consistency of MNL2S and NLLI requires the additional as- 
sumption 


Ev? Ev?u, = Ev3Ev,u,, (8.1.39) 


which is satisfied if u, and v, are jointly normal. 


8.2 Estimation in a System of Equations 
8.2.1 Introduction 
Define a system of N nonlinear simultaneous equations by 
fo Xr %) = Uy,  i—-12,...,NM t=1,2,...,7, 
(8.2.1) 


where y, is an N-vector of endogenous variables, x, is a vector of exogenous 
variables, and a, is a K,-vector of unknown parameters. We assume that the 
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N-vector u, = (Uis, 4, . . . , Vy). is an iid. vector random variable with 
zero mean and variance-covariance matrix X. Not all of the elements of 
vectors y, and x, may actually appear in the arguments of each fy. We assume 
that each equation has its own vector of parameters a; and that there are no 
constraints among @,’s, but the subsequent results can easily be modified if 
each q; can be parametrically expressed as a,(@), where the number of ele- 
ments in @ is less than Zi, K;. 

Strictly speaking, (8.2.1) is nota complete model by itself because there is no 
guarantee that a unique solution for y, exists for every possible value of u; 
unless some stringent assumptions are made on the form of fp. Therefore we 
assume either that f, satisfies such assumptions or that if there is more than 
one solution for y,, there is some additional mechanism by which a unique 
solution is chosen.* 

We shall not discuss the problem of identification in the model (8.2.1). 
There are not many useful results in the literature beyond the basic discussion 
of Fisher(1966), as summarized by Goldfeld and Quandt (1972, p. 221), anda 
recent extension by Brown (1983). Nevertheless, we want to point out that 
nonlinearity generally helps rather than hampers identification, so that, for 
example, in a nonlinear model the number of excluded exogenous variables in 
a given equation need not be greater than or equal to the number of parame- 
ters ofthe same equation. We should also point out that we have actually given 
one sufficient condition for identifiability — that plim 7^ (G;P,,G,) in the 
conclusion of Theorem 8.1.2 is nonsingular. 

Definition of the symbols used in the following sections may facilitate the 
discussion. 


«—(0j,05,... , aNY 

A — © I, where C is the Kronecker product 
Sit fa yo, Xe 0) 

f, = an N-vector, the ith element of which is f; 
f = a T-vector, the tth element is fy. 

I= (fi fo... f(y)’, an NT-vector 

F = (ffo... » fan), a TX N matrix 


Of 
Eir = — da, a Kj-vector 


E 


G= P" a T X K, matrix, the tth row is g; 
i 


G = diag(G,, G,,. . . , Gy}, an NT X (2%, K;) block diagonal matrix 
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8.2.2 Nonlinear Three-Stage Least Squares Estimator 


As a natural extension of the class of the NL2S estimators — the version that 
minimizes (8.1.22), Jorgenson and Laffont (1974) defined the class of nonlin- 
ear three-stage least squares (NL3S) estimators as the value of a that mini- 
mizes 


rÊ- 6 W(W'W)' WIE, (8.2.2) 


where $ is a consistent estimate of X. For example, 
a 1 4L, 
z= T Y, £(@)£,(@)’, (8.2.3) 
=i 


where à is the NL2S estimator obtained from each equation. This definiton of 
the NL3S estimators is analogous to the definition of the linear 3SLS as a 
generalization of the linear 2SLS. The consistency and the asymptotic nor- 
mality ofthe NL3S estimators defined in (8.2.2) and (8.2.3) have been proved 
by Jorgenson and Laffont (1974) and Gallant (1977). 

The consistency of the NL2S and NL3S estimators of the parameters of 
model (8.2.1) can be proved with minimal assumptions on u;— namely, 
those stated after (8.2.1). This robustness makes the estimators attractive. 
Another important strength of the estimators is that they retain their consist- 
ency regardless of whether or not (8.2.1) yields a unique solution for y, and, in 
the case of multiple solutions, regardless of what additional mechanism 
chooses a unique solution. (MaCurdy, 1980, has discussed this point further.) 
However, in predicting the future value of the dependent variable, we must 
know the mechanism that yields a unique solution. 

Amemiya (19772) defined the class of the NL3S estimators more generally 
as the value of æ that minimizes 


f’A-'S(S’A-'S)"'S’A“'f, (8.2.4) 


where A is a consistent estimate of A and S is a matrix of constants with NT 
rows and with the rank of at least £X, K;. This definition is reduced to the 
Jorgenson-Laffont definition if S = diag( W, W,. . . , W). The asymptotic 
variance-covariance matrix of VT times the estimator is given by 

V, = plim T[G'A"!S(S'A"!S) 'S'A"!G]"!. (8.2.5) 
Its lower bound is equal to 

Vg; = lim T[EG'A"! EG]! (8.2.6) 
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which is attained when we choose S = EG. We call this estimator the BNL3S 
estimator (B for “‘best’’). 

We can also attain the lower bound (8.2.6) using the Jorgenson-Laffont 
definition, but that is possible if and only if the space spanned by the column 
vectors of W contains the union of the spaces spanned by the column vectors 
of EG;fori = 1,2,. . . , N. This necessitates including many columns in W, 
which is likely to increase the finite sample variance of the estimator although 
it has no effect asymptotically. This is a disadvantage of the Jorgenson-Laffont 
definition compared to the Amemiya definition. 

Noting that BNL3S is not practical, as was the case with BNL2S, Amemiya 
(1976a) suggested the following approximation: 

Step 1. Compute à;, an SNL2S estimator of aj, i= 1,2,. . . , N. 

Step 2. Evaluate G, at à, — call it G,. 

Step 3. Treat G, as the dependent variables of a regression and search for the 
optimal set of independent variables W, that best predict G,. 

Step 4. Choose S= diag(P,G, QB)G,. ls, P,Gy}, where P,— 
W(WIW,)'W;. 

Applications of NL3S can be found in articles by Jorgenson and Lau (1975), 
who estimated a three-equation translog expenditure model; by Jorgenson 
and Lau (1978), who estimated a two-equation translog expenditure model; 
and by Haessel (1976), who estimated a system of demand equations, nonlin- 
ear only in parameters, by both NL2S and NL3S estimators. 


8.2.3 Nonlinear Full information Maximum Likelihood Estimator 


In this subsection we shall consider the maximum likelihood estimator of 
model (8.2.1) under the normality assumption of u,,. To do so we must 
assume that (8.2.1) defines a one-to-one correspondence between y, and u,. 
This assumption enables us to write down the likelihood function in the usual 
way as the product of the density of u, and the Jacobian. Unfortunately, this is 
a rather stringent assumption, which considerably limits the usefulness of the 
nonlinear full information maximum likelihood (NLFI) estimator in prac- 
tice. There are two types of problems: (1) There may be no solution for y for 
some values of u. (2) There may be more than one solution for y for some 
values ofu. In the first case the domain ofu must be restricted, a condition that 
implies that the normality assumption cannot hold. In the second case we 
must specify a mechanism by which a unique solution is chosen, a condition 
that would complicate the likelihood function. We should note that the NL2S 
and NL3S estimators are free from both of these problems. 
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Assuming u, ~ N(0, X), we can write the log likelihood function of model 
(8.2.1) as 
I*-—— > log [z| + X log l3f,/dy/ll — — >) f; x d. (8.2.7) 
2 ái 
Solving 0L*/8X, = 0 for X, we get 
1 T 
L= 5 > ffi (8.2.8) 
t=] 


Inserting (8.2.8) into (8.2.7) yields the concentrated log likelihood function 


1 T 
L= Y log laf, /ay/ll — — 7 log TS «i (8.2.9) 


ml 


The NLFI maximum likelihood estimator of o is defined as the value of a that 
maximizes (8.2.9). 

Amemiya (1977a) proved that if the true distribution of u, is normal, NLFI 
is consistent, asymptotically normal, and in general has a smaller asymptotic 
covariance matrix than BNL3S. It is well known that in the linear model the 
full information maximum likelihood estimator has the same asymptotic 
distribution as the three-stage least squares estimator. Amemiya showed that 
the asymptotic equivalence occurs if and only if f, can be written in the form 


Sil Yes Xy, Qi) = ALY EY., X,) + BQ, X), (8.2.10) 


where z is an N-vector of surrogate variables. 

If the true distribution of u, is not normal, on the other hand, Amemiya 
proved that NLFI is generally not consistent, whereas NL3S is known to be 
consistent even then. This, again, is contrary to the linear case in which the full 
information maximum likelihood estimator obtained under the assumption 
of normality is consistent even ifthe true distribution is not normal. Note that 
this result is completely separate from and in no way contradicts the quite 
likely fact that the maximum likelihood estimator of a nonlinear model de- 
rived under the assumption of a certain regular nonnormal distribution is 
consistent if the true distribution is the same as the assumed distribution. 

We shall see how the consistency of NLFI crucially depends on the normal- 
ity assumption. Differentiating (8.2.9) with respect to o, we obtain 


OL C 98. 
2a, 24 du, T TS ad (See) , (8.2.11) 
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where ( );! denotes the ith column of the inverse of the matrix within the 
parentheses. The consistency of NLFI is equivalent to the condition 


lim E——j =0 (8.2.12) 
and hence to the condition 

: “ee 

lim = LY Eau. — lim = > Eg, u;o', (8.2.13) 


T 


where g'is the ith column of X !. Now (8.2.13) could hold even if each term of 
a summation is different from the corresponding term of the other, but that 
event is extremely unlikely. Therefore we can say that the consistency of NLFI 
is essentially equivalent to the condition 


Bir 
Ou, 


It is interesting to note that the condition (8.2.14) holds if u, is normal 
because of the following lemma.’ 


= Eg, u/o*. (8.2.14) 


LEMMA. Supposeu-(u,,U,,. . . , Uy)’ is distributed as N(0, £), where X 
is positive definite. If 6h(u)/du, is continuous, E|óh/8u,| < ©, and E|Ahuj| < 
c, then 

oh 


— = Fai 
E u, Ehwo', (8.2.15) 


where c! is the ith column of 27). 


In simple models the condition (8.2.14) may hold without normality. In the 
model defined by (8.1.36) and (8.1.37), we have g, = —z? and g;,—— x,. 
Therefore (8.2.14) clearly holds for i — 2 for any distribution of u, provided 
that the mean is 0. The equation for i = 1 gives (8.1.39), which is satisfied by a 
class of distributions including the normal distribution. (Phillips, 1982, has 
presented another simple example.) However, if g, is a more complicated 
nonlinear function of the exogenous variables and the parameters {a} as well 
as of u, (8.2.14) can be made to hold only when we specify a density that 
depends on the exogenous variables and the parameters of the model. In such 
acase, normality can be regarded, for al] practical purposes, as a necessary and 
sufficient condition for the consistency of NLFI. 

It is interesting to compare certain iterative formulae for calculating NLFI 
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and BNL3S. By equating the right-hand side of (8.2.11) to 0 and rearranging 
terms, we can obtain the following iteration to obtain NLFI: 


Gia) = à — GAG GÀ M, (8.2.16) 
where 
Ly Bir 1 (8.2.17) 
T & ou 
and G = diag(G, ; G, EDD &) and all the variables that appear in the sec- 


ond term of the right-hand side of (8.2.16) are evaluated at à. 
The Gauss-Newton iteration for BNL3S is defined by 


ĉa = ĉn — (G'A^1G)-'G'À- 1f, (8.2.18) 


where G; = EG} and G = diag(G,, G,,. . . , Gy) as before. 

Thus we see that the only difference between (8.2.16) and (8.2.18) is in the 
respective “instrumental variables" used in the formulae. Note that G, defined 
in (8.2.17) can work as a proper set of “instrumental variables" (that is, 
variables uncorrelated with u,) only if u, satisfies the condition of the afore- 
mentioned lemma, whereas G,is always a proper set of instrumental variables, 
a fact that implies that BNL3S is more robust than NLFI. If u, is normal, 
however, G, contains more of the part of G, uncorrelated with u, than G, does, 
which implies that NLFI is more efficient than BNL3S under normality. 

Note that (8.2.16) is a generalization of the formula (7.2.12) for the linear 
case. Unlike the iteration of the linear case, however, the iteration defined by 
(8.2.16) does not have the property that & is asymptotically equivalent to 
NLFI when â, is consistent. Therefore its main value may be pedagogical, 
and it may not be useful in practice. 


8.3 Tests of Hypotheses, Prediction, and Computation 
8.3.1 Tests of Hypotheses 


Suppose we want to test a hypothesis of the form h(a) = 0 in model (8.1.1), 
where h isa q-vector of nonlinear functions. Because we have not specified the 
distribution of Y,, we could not use the three tests defined in (4.5.3), (4.5.4), 
and (4.5.5) even if we assumed the normality of u. But we can use two test 
statistics: (1) the generalized Wald test statistic analogous to (4.5.21), and (2) 
the difference between the constrained and the unconstrained sums of 
squared residuals (denoted SSRD). Let à and & be the solutions of the uncon- 
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strained and the constrained minimization of (8.1.2), respectively. Then the 
generalized Wald statistic is given by 


Wald => a 5 ——— h(&) [EG P G)- f ]-!h(à), (8.3.1) 
where H = [dh/da’ ia and G = [éf/da’];, and the SSRD statistic is given by 
SSRD = = a 3 [S(&) — S,(à)]. (8.3.2) 


Both test statistics are asymptotically distributed as chi-square with g degrees 
of freedom. 

Gallant and Jorgenson (1979) derived the asymptotic distribution (a non- 
central chi-square) of the two test statistics under the assumption that œ 
deviates from the hypothesized constraints in the order of T^!2. 

As an application of the SSRD test, Gallant and Jorgenson tested the hy- 
pothesis of homogeneity of degree zero of an equation for durables in the 
two-equation translog expenditure model of Jorgenson and Lau (1978). 

The Wald and SSRD tests can be straightforwardly extended to the system 
of equations (8.2.1) by using NL3S in lieu of NL2S. As an application ofthe 
SSRD test using NL3S, Gallant and Jorgenson tested the hypothesis of sym- 
metry ofthe matrix of parameters in the three-equation translog expenditure 
model of Jorgenson and Lau (1975). 

If we assume (8.1.22) in addition to (8.1.1) and assume normality, the 
model is specified (although it is a limited information model); therefore the 
three tests of Section 4.5.1 can be used with NLLI. The same is true of NLFI in 
model (8.1.22) under normality. The asymptotic results of Gallant and Holly 
(1980) given in Section 4.5.1 are also applicable if we replace o?(G'G) ! in the 
right-hand side of (4.5.26) by the asymptotic covariance matrix of the NLFI 
estimator. 


8.3.2 Prediction 


Bianchi and Calzolari (1980) proposed a method by which we can calculate 
the mean squared prediction error matrix of a vector predictor based on any 
estimator of the nonlinear simultaneous equations model. Suppose the struc- 
tural equations can be written as f(y,, x,, &) = u, at the prediction period p 
and we can solve for y, as y, = g(x, , @, u,). Define the predictor $, based on 
the estimator à by $, = g(x,, à, 0). (Note that y, is an N-vector.) We call this 
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the deterministic predictor. Then we have 
E(y, — $,(y, — ¥p)’ (8.3.3) 
= Ele(x,, a, uy) — g(x,, æ, 0)] [g(x,, ox, uj) — g(x,, o, 0)]' 
+ E[g(x,, a, 0) — g(x,, à 0)] [g(x,, a, 0) — g(x, à, 0)]' 
= A, + Az. 


Bianchi and Calzolari suggested that A, be evaluated by simulation. As for A3, 
we can easily obtain its asymptotic value from the knowledge of the asymp- 
totic distribution of à. 

Mariano and Brown (1983) and Brown and Mariano (1982) compared the 
deterministic predictor $, defined in the preceding paragraph with two other 
predictors called the Monte Carlo predictor y, and the residual-based predic- 
tor y, defined as 


— 1s ^ 
T$ 2 g(x,, à, Vs), (8.3.4) 
where (v,) are i.i.d. with the same distribution as u,, and 


$,7 7. >a , â, à), (8.3.5) 


where i, = f(y,, x,, à). 

Because y, — $, = (y, — Ey,) + (Ey, — §,) for any predictor $,, we should 
compare predictors on the basis of how well Ey, = Eg(x,, &, u,) is estimated 
by each predictor. Moreover, because & is common for the three predictors, 
we can essentially consider the situation where à in the predictor is replaced by 
the parameter a. Thus the authors’ problem is essentially equivalent to that of 
comparing the following three estimators of Eg(u,): 


Deterministic: (0) 


Monte Carlo: — >) g v.) 


S; 


Residual-based: 1$ Y gu) 
t=1 


Clearly, the deterministic predictor is the worst, as Mariano and Brown (1983) 
concluded. According to their other article (Brown and Mariano, 1982), the 
choice between the Monte Carlo and residual-based predictors depends on the 
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consideration that the former can be more efficient if S is large and the 
assumed distribution ofu, is true, whereas the latter is simpler to compute and 
more robust in the sense that the distribution of u, need not be specified. 


8.3.3 Computation 


The discussion of the computation of NLFI preceded the theoretical discus- 
sion of the statistical properties of NLFI by more than ten years. The first 
article on computation was by Eisenpress and Greenstadt (1966), who pro- 
posed a modified Newton-Raphson iteration. Their modification combined 
both (4.4.4) and (4.4.5). Chow (1973) differed from Eisenpress and Greenstadt 
in that he obtained simpler formulae by assuming that different parameters 
appear in different equations, as in (8.2.1). We have already mentioned the 
algorithm considered by Amemiya (19772), mainly for a pedagogical purpose. 
Dagenais (1978) modified this algorithm to speed up the convergence and 
compared it with a Newton-Raphson method proposed by Chow and Fair 
(1973) and with the DFP algorithm mentioned in Section 4.4.1 in certain 
examples of nonlinear models. The results are inconclusive. Belsley (1979) 
compared the computational speed ofthe DFP algorithm in computing NLFI 
and NL3S in five models of various degrees of complexity and found that 
NL3S was three to ten times faster. Nevertheless, Belsley showed that the 
computation of NLFI is quite feasible and can be improved by using a more 
suitable algorithm and by using the approximation of the Jacobian proposed 
by Fair— see Eq. (8.3.6). 

Fair and Parke (1980) estimated Fair's(1976) macro model (97 equations, 
29 of which are stochastic, with 182 parameters including 12 first-order auto- 
regressive coefficients), which is nonlinear in variables as well as in parameters 
(this latter nonlinearity is caused by the transformation to take account ofthe 
first-order autoregression of the errors), by OLS, SNL2S, the Jorgenson-Laf- 
font NL3S, and NLFI. The latter two estimators are calculated by a deriva- 
tive-free algorithm proposed by Parke (1982). 

Parke noted that the largest model for which NLFI and NL3S had been 
calculated before Parke's study was the one Belsley calculated, a model that 
contained 19 equations and 61 unknown coefficients. Parke also noted that 
Newton's method is best for linear models and that the DFP method is pre- 
ferred for small nonlinear models; however, Parke's method is the only feasi- 
ble one for large nonlinear models. 
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Fair and Parke used the approximation of the Jacobian 
T T 
2 log |J,| = * (log |J, |+ log |J, - . . . + log|J,,]). (8.3.6) 
t= 


where J, = of,/dy/, n is a small integer, and t,,t,,. . . , t, are equally spaced 
between 1 and T. 

The hypothesis (8.2.14) can be tested by Hausman’s test (Section 4.5.1) 
using either NLFI versus NL3S or NLFI versus NL2S. By this test, Fair found 
little difference among the estimators. Fair also found that in terms of predic- 
tive accuracy there is not much difference among the different estimators, but, 
in terms of policy response, OLS is set apart from the rest. 

Hatanaka (1978) considered a simultaneous equations model nonlinear 
onlyi in variables. Such a model can be written as F(Y, X) + XB = U. Define 
€ by F(t, xr + XB = 0, where Î and Ê are the OLS estimates. Then 
Hatanaka proposed using F(Y, X) as the instruments to calculate 3SLS. 
He proposed the method-of-scoring iteration to calculate NLFI, where the 
iteration is started at the aforementioned 3SLS. 


Exercises 


1. (Section 8.1.2) 
Define what you consider to be the best estimators of a and fl in the model 
(y, + a = fx, u, t—1,2,... , T, where (x,) are known constants 
and (u,) are iid. with Eu, = 0 and Vu, = o?. Justify your choice of esti- 
mators. 


2. (Section 8.1.2) 
In model (8.1.12) show that the minimization of ZZ, [z/(4) — x;B]? with 
respect to A and fl yields inconsistent estimates. 


3. (Section 8.1.3) 
Prove the consistency of the estimator of œ obtained by minimizing 
(y — fY[L— V(V'Vy! V'] (y — f) and derive its asymptotic variance- 
covariance matrix. Show that the matrix is smaller (in the matrix sense) 
than V, given in (8.1.33). 


4. (Section 8.1.3) 
In the model defined by (8.1.36) and (8.1.37), consider the following two- 
stage estimation method: In the first stage, regress z, on x, and define 
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2, = ix, where 7 is the least squares estimator; in the second stage, regress 
y, on (Z? to obtain the least squares estimator of a. Show that the resulting 
estimator of a is inconsistent. (This method may be regarded as an appli- 
cation of Theil's interpretation — Section 7.3.6 — to a nonlinear model.) 


5. (Section 8.1.3) 
In the model defined by (8.1.36) and (8.1.37), show that the consistency of 
MNIL2S and NLLI requires (8.1.39). 


6. (Section 8.1.3) 
In the model defined by (8.1.36) and (8.1.37), assume xz = 1, x, 1 forall t, 
Vu, = Vv, = 1, and Cov(u,, v,) ^ c. Evaluate the asymptotic variances, 
denoted V,, V,, V3, and V,, of the SNL2S, BNL2S, MNL2S, and NLLI 
estimators of a and show V, 2 V, 2 V, z V, for every |c| € 1. 


7. (Section 8.2.3) 
Consider the following two-equation model (Goldfeld and Quandt, 1968): 


log Vis = 7; log ya + By + Box, + ui, 
Ya = YaYu + fx, + uy, 


where y, > 0 and y, < 0. Show that there are two solutions of y;, and y, for 
a given value of (Ur, ux). Show also that (1,,, ua) cannot be normally 
distributed. 


8. (Section 8.2.3) 
Consider the following model (Phillips, 1982): 


log y, + QX, = t; 
ya + 05yy = Uy. 


Derive the conditions on u, and u, that make NLFI consistent. Use 
(8.2.14). 


9 Qualitative Response Models 


9.1 Introduction 


Qualitative response models (henceforth to be abbreviated as QR models) are 
regression models in which dependent (or endogenous) variables take discrete 
values. These models have numerous applications in economics because 
many behavioral responses are qualitative in nature: A consumer decides 
whether to buy a car or not; a commuter chooses a particular mode of trans- 
portation from several available ones; a worker decides whether to take a job 
offer or not; and so on. A long list of empirical examples of QR models can be 
found in my recent survey (Amemiya, 1981). 

Qualitative response models, also known as quantal, categorical, or discrete 
models, have been used in biometric applications longer than they have been 
used in economics. Biometricians use the models to study, for example, the 
effect of an insecticide on the survival or death of an insect, or the effect of a 
drug on a patient. The kind of QR model used by biometricians is usually the 
simplest kind — univariate binary (or dichotomous) dependent variable (sur- 
vival or death) and a single independent variable (dosage). 

Economists (and sociologists to a certain extent), on the other hand, must 
deal with more complex models, such as models in which a single dependent 
variable takes more than two discrete values (multinomial models) or models 
that involve more than one discrete dependent variable (multivariate models), 
as well as considering a larger number of independent variables. The estima- 
tion of the parameters of these complex models requires more elaborate 
techniques, many of which have been recently developed by econometricians. 

This chapter begins with a discussion of the simplest model — the model for 
a univariate binary dependent variable (Section 9.2), and then moves on to 
multinomial and multivariate models (Sections 9.3 and 9.4). The emphasis 
here is on the theory of estimation (and hypothesis testing to a lesser extent) 
and therefore complementary to Amemiya's survey mentioned earlier, which 
discussed many empirical examples and contained only fundamental results 
on the theory of statistical inference. We shall also discuss important topics 


268 Advanced Econometrics 


omitted by the survey — choice-based sampling (Section 9.5), distribution- 
free methods (Section 9.6), and panel data QR models (Section 9.7). 


9.2 Univariate Binary Models 
9.2.1 Model Specification 
A univariate binary QR model is defined by 
Ply, = 1) = F(x),  i—-42,...,n (9.2.1) 


where (y;) is a sequence of independent binary random variables taking the 
value 1 or 0, x; is a K-vector of known constants, fọ is a K-vector of unknown 
parameters, and F is a certain known function. 

It would be more general to specify the probability as F(x,, f), but the 
specification (9.2.1) is most common. As in the linear regression model, 
specifying the argument of F as x/f, is more general than it seems because the 
elements of x, can be transformations of the original independent variables. 
To the extent we can approximate a general nonlinear function ofthe original 
independent variables by x48), the choice of F is not critical as long as it is a 
distribution function. An arbitrary distribution function can be attained by 
choosing an appropriate function H in the specification F[H(x;, Bo)]. . 

The functional forms of F most frequently used in application are the 
following: 


Linear Probability Model 


F(x) x. 
Probit Model 
FQ9- Ox) = [ i A exp [—(2/2)] dt. 
Logit Model 
FQ) = AG) = — —. 
1+ e* 


The linear probability model has an obvious defect in that F for this model 
is not a proper distribution function as it is not constrained to lie between 0 
and 1. This defect can be corrected by defining F = 1 if F(x/B)) > land F=0 
if F(x{8) < 0, but the procedure produces unrealistic kinks at the truncation 
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points. Nevertheless, it has frequently been used in econometric applica- 
tions, especially during and before the 1960s, because of its computational 
simplicity. 

The probit model, like many other statistical models using the normal 
distribution, may be justified by appealing to a central limit theorem. A major 
justification for the logit model, although there are other justifications (nota- 
bly, its connection with discriminant analysis, which we shall discuss in Sec- 
tion 9.2.8), is that the logistic distribution function A is similar to a normal 
distribution function but has a much simpler form. The logistic distribution 
has zero mean and variance equal to 22/3. The standardized logistic distribu- 
tion e^*/(1 + e^*) with A = z/ V3 has slightly heavier tails than the standard 
normal distribution. 

We shall consider two examples, one biometric and one econometric, ofthe 
model (9.2.1) to gain some insight into the problem of specifying the probabil- 
ity function. 


EXAMPLE 9.2.1. Suppose that a dosage x, (actually the logarithm of dosage is 
used as x; in most studies) of an insecticide is given to the ith insect and we 
want to study how the probability of the ith insect dying is related to the dosage 
x;. (In practice, individual insects are not identified, and a certain dosage x, is 
given to each of n, insects in group t. However, the present analysis is easier to 
understand if we proceed as if each insect could be identified). To formulate 
this model, it is useful to assume that each insect possesses its own tolerance 
against a particular insecticide and dies when the dosage level exceeds the 
tolerance. Suppose that the tolerance y* of the ith insect is an independent 
drawing from a distribution identical for all insects. Moreover, if the tolerance 
is a result of many independent and individually inconsequential additive 
factors, we can reasonably assume y* ~ N( u, 0?) because of the central limit 
theorem. Defining y; = 1 if the ith insect dies and y; = 0 otherwise, we have 


Py, = 1) = P(yf < xj) = 9[Gx; — uy], (9.2.2) 


giving rise to a probit model where fj, = — u/c and fi, = 1/0. If, on the other 
hand, we assume that y* has a logistic distribution with mean 4 and variance 
c?, we get a logit model 


PY, = n=l aze) (9.2.3) 


EXAMPLE 9.2.2 (Domencich and McFadden, 1975). Let us consider the 
decision of a person regarding whether he or she drives a car or travels by 
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trdnsit to work. We assume that the utility associated with each mode of 
transport is a function of the mode characteristics z (mainly the time and the 
cost incurred by the use of the mode) and the individual’s socioeconomic 
characteristics w, plus an additive error term e. We define U; and U,,as the ith 
person's indirect utilities associated with driving a car and traveling by transit, 
respectively. Then, assuming a linear function, we have 


Un = Qo + ZB wiyo + Ev (9.2.4) 
and 
Un =Q, + zap + wiy t En. (9.2.5) 
The basic assumption is that the ith person drives a car if U;, > U pand travels 
by transit if U;, < Ug. (There is indecision if U;, = Uw, but this happens with 
zero probability if €,, and ey are continuous random variables.) Thus, defining 
y, = 1 if the ith person drives a car, we have 
P(y,= 1) = P(Un > Up) (9.2.6) 
= Pl€g — €; <A, — æo + (zi — Zo Yp + wi — Yo)) 
= F[(a, — æ) + (za — zo)'B + wii — Yo)], 


where F is the distribution function of € — €n. Thus a probit (logit) model 
arises from assuming the normal (logistic) distribution for €w — €;;. 


9.2.2 Consistency and Asymptotic Normality of the Maximum Likelihood 
Estimator 


To derive the asymptotic properties of the MLE, we make the following 
assumptions in the model (9.2.1). 


ASSUMPTION 9.2.1. F has derivative f and second-order derivative f’, and 
0 < F(x) € 1 and f(x) > 0 for every x. 


ASSUMPTION 9.2.2. The parameter space B is an open bounded subset ofthe 
Euclidean K-space. 


ASSUMPTION 9.2.3. (x,) are uniformly bounded in i and lim, ,. n^! Z7, 
x,x/is a finite nonsingular matrix. Furthermore the empirical distribution of 
(x;) converges to a distribution function. 


Both probit and logit models satisfy Assumption 9.2.1. The boundedness of 
(x,) is somewhat restrictive and could be removed. However, we assume this 
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to make the proofs of consistency and asymptotic normality simpler. In this 
way we hope that the reader will understand the essentials without being 
hindered by too many technical details. 

The logarithm of the likelihood function of the model (9.2.1) is given by 


log L -5 y; log F(x;if) + > (1 — y,) log [1 — F(x;fl)]. (9.2.7) 


Therefore the MLE B is a solution (if it exists) of 


ô log L - 
af i=l F- 


where F; = F(x/f) and f, = n 

To prove the consistency of f, we must verify the assumptions of Theorem 
4.1.2. Assumptions A and B are clearly satisfied. To verify C, we use Theorem 
4.22. If we define g;(y, B) = [y — F(xif;)] log F(x;f), g,(y, B) in a compact 
neighborhood of f, satisfies all the conditions for g,(y, 0) in the theorem 
because of Assumptions 9.2.1 and 9.2.3. Furthermore lim,_.. n ! Zi Fo 
log F; exists because of Theorem 4.2.3. Therefore n^! Z7. , y; log F; converges 
tolim, 4 n^! ZI, Fp log F;, where Fo = F(xifo), in probability uniformly in 
B € N(f,), an open neighborhood of f. A similar result can be obtained for 
the second term of the right-hand side of (9.2.7). Therefore 


Q(B) = plim n^! log L (9.2.9) 


FOF ™ (9.2.8) 


= lim n`! z Fp log F; + lim n^! p (1 — Fp) log (1 — F;), 


where the convergence is uniform in fl E N( Bp). Because our assumptions 
enable us to differentiate inside the limit operation in (9.2.9),! we obtain 


a F n 1-—F, 
98 Z lim n-1 Y P9 fx, — lim n7! ® fx; 2 
7 lim n » F, fix, — lim n p [=F SIX (9.2.10) 
which vanishes at f = f. Furthermore, because 
PQ n 
T:75| 7-limm! X,X/, (9.2.11) 
SPB’ Sra Fo 


which is negative definite by our assumptions, Q attains a strict local maxi- 
mum at f = f. Thus assumption C of Theorem 4.1.2. holds, and hence the 
consistency of f has been proved. 

A solution of (9.2.8) may not always exist (see Albert and Anderson, 1984). 
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For example, suppose that {x,} are scalars such that x; < 0 for i £ c for some 
integer c between 1 and n and x; > 0 for i> c and that y, = 0 for iS c and 
y; = 1 fori > c. Then log L does not attain a maximum for any finite value of 
B. If (x,) are K-vectors, the same situation arises if y = 0 for the values of x; 
lying within one side of a hyperplane in R* and y = 1 for the values of x; lying 
within the other side. However, the possibility of no solution of (9.2.8) does 
not affect the consistency proof because the probability of no solution ap- 
proaches 0 as n goes to © as shown in Theorem 4.1.2. 

Next we shall prove asymptotic normality using Theorem 4.2.4, which 
means that assumptions A, B, and C of Theorem 4.1.3 and Eq. (4.2.22) need to 
be verified. Differentiating (9.2.8) with respect to f yields 


lgL — n yi— F, I uu 
apap’ 2 | "s | Aiii (9.2.12) 


d yi- Fi | 
+ — 
p E (1 -F i) 
Thus assumption A of Theorem 4.1.3 is clearly satisfied. We can verify that 


Assumptions 9.2.1 and 9.2.3 imply assumption B of Theorem 4.1.3 by using 
Theorems 4.1.5 and 4.2.2. Next, we have from (9.2.8) 


1 àlogL 


d Qd 


Yn af |- PAKEA 


The asymptotic normality of (9.2.13) readily follows from Theorem 3.3.5 
(Liapounov CLT) because each term in the summation is independent with 
bounded second and third absolute moments because of Assumptions 9.2.1, 
9.2.2, and 9.3.3. Thus assumption C of Theorem 4.1.3 is satisfied and we 
obtain 


foXi- (9.2.13) 


1 àlogL 
= —> N(0, A), (9.2.14) 
m B la ON 
where 
A=lim— > x xi- (9.2.15) 


=—A, (9.2.16) 
if 
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verifying (4.2.22). Therefore 
Vn(B — Bo) > NO, A“). (9.2.17) 


For a proof of consistency and asymptotic normality under more general 
assumptions in the logit model, see the article by Gourieroux and Monfort 
(1981). 


9.2.3 Global Concavity of the Likelihood Function in the Logit and Probit 
Models 


Global concavity means that 6? log L/offofi' is a negative definite matrix for 
f € B. Because we have by a Taylor expansion 


à log = a 
= 9.2.18 
ap |; (B — p) ( ) 


8 log L a 
aor p 0-9. 
where f* lies between f and Ê, global concavity implies that log L( Ê > 
log L(2) for B # Bif is a solution of (9.2.8). We shall prove global concavity 
for logit and probit models. 
For the logit model we have 


log L(B) = log L(B) + 


1 Qv 
+5 (B-B) 


ðA PA 0A 
a AG — A) and ad -2A) ax] (9.2.19) 
Inserting (9.2.19) into (9.2.12) with F — A yields 
8? log L n 
=— — A, xx! 22i 
ofop' » A AXX; (9 0) 


where A; = A(x/ff). Thus the global concavity follows from Assumption 9.2.3. 

A proof of global concavity for the probit model isa little more complicated. 
Putting F, = ®,, f, = hi, and f; = — x/Bd,, where ó is the density function of 
N(O, 1), into (9.2.12) yields 


à log L 
apop’ 


=- Y $,0;X1— 0))7(y; -2y0;- 62); — (9220 
mi 


t (y; — DOL — O;)xif]xixi. 
Thus we need to show the positivity of 
B(x) = (y — 2yD + D?) + (y — DAI — ©) x 
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for y=1 and 0. First, consider the case y=1. Because g,(x)= 
(1 — o) + x), we need to show $ + x > 0. The inequality is clearly 
satisfied if x = 0, so assume x < 0. But this is equivalent to showing 


@>(1-—®)x for x>0, (9.2.22) 
which follows from the identity (see Feller, 1961, p. 166) 


x7! exp (—x?/2) = [ (1 + y?) exp (—y?/2) dy. (9.2.23) 


Next, if y = 0, we have g(x) = d?[ó — (1 — ®)x], which is clearly positive if 
X =0 and is positive if x > 0 because of (9.2.22). Thus we proved global 
concavity for the probit case. 


9.2.4 Iterative Methods for Obtaining the Maximum Likelihood Estimator 


The iterative methods we discussed in Section 4.4 can be used to calculate a 
root of Eq. (9.2.8). For the logit and probit models, iteration is simple because 
of the global concavity proved in the preceding section. Here we shall only 
discuss the method-of-scoring iteration and give it an interesting interpreta- 
tion. 

As we noted in Section 4.4, the method-of-scoring iteration is defined by 


2 2 s puer 
=f, — 4| E —— y , 
h= h ü apoB Ji) — 3B li 
where B, is an initial estimator of fl; and Ê is the second-round estimator. The 


iteration is to be repeated until a sequence of estimators thus obtained con- 
verges. Using (9.2.8) and (9.2.12), we can write (9.2.24) as 


- [a f [ a f a foie 
= Za), RA. i u—— —— Xi[y; Fit xB), 
b b Bü-£)" p fa -Ê D fx] 


(9.2.24) 


(9.2.25) 


where we have defined Ê; = F (x/B, ) and f, ^ fi Ê, ). 
An interesting interpretation of the iteration (9.2.25) is possible. From 
(9.2.1) we obtain 
yi = F(xifo) + ui, (9.2.26) 


where Eu, = 0 and Vu; = F(xifo)[1 — F(x{Bo)]. This is a heteroscedastic non- 
linear regression model. Expanding F(x/fl;) in a Taylor series around fi, = fl, 
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and rearranging terms, we obtain 
yi- Pi fixi = fxipo us. (9.2.27) 


Thus Ê defined in (9.2.25) can be interpreted as the weighted least squares 
(WLS) estimator of fl applied to (9.2.27) with Vu; estimated by £j -— £,). 
For this reason the method-of-scoring iteration in the QR model is sometimes 
referred to as the nonlinear weighted least squares (NLWLS) iteration 
(Walker and Duncan, 1967). 


9.2.5 Berkson's Minimum Chi-Square Method 


There are many variations of the minimum chi-square (MIN 5?) method, one 
of which is Berkson's method. For example, the feasible generalized least 
squares (FGLS) estimator defined in Section 6.2 is a MIN 7? estimator. An- 
other example is the Barankin-Gurland estimator mentioned in Section 4.2.4. 
A common feature of these estimators is that the minimand evaluated at the 
estimator is asymptotically distributed as chi-square, from which the name is 
derived. 

The MIN 7? method in the context of the QR model was first proposed by 
Berkson (1944) for the logit model but can be used for any QR model. It is 
useful only when there are many observations on the dependent variable y 
having the same value of the vector of independent variables x (sometimes 
referred to as “many observations per cell") so that F(x'fl;) for the specific 
value of x can be accurately estimated by the relative frequency of y being 
equal to 1. 

To explain the method in detail, we need to define new symbols. Suppose 
that the vector x; takes T distinct vector values xj, X, . . . , X(7); and 
classify integers(1, 2,. . . , m)into Tdisjointsets/,,/,,. . . , Irby the rule: 
i€ I ifx; — Xg. Define Po = Ply, = 1)ifi € I,.In addition, define n, as the 
number ofi integers contained in J,, r, = Zien Yi and Py= r,/n,. Note that 
(Po) constitute the sufficient statistics of the model. In the following discus- 
sion we shall write x), Py), and Py as x,, P,, and P, if there is no ambiguity. 

From (9.2.1) we have 


P,=F& p), t=1,2,...,T. (9.2.28) 


If F is one-to-one (which is implied by Assumption 9.2.1), we can invert the 
relationship in (9.2.28) to obtain 


FP) = xipo, (9.2.29) 
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where F~! denotes the inverse function of F. Expanding F-(£,) in a Taylor 
series around P, (which is possible under Assumption 9.2.1), we obtain 


OF! 


—| (£,— 2. 
3P, „Ê P,) (9.2.30) 


F-'(P,) = xio + 


= xf + (P,— P.) 


1 
fJIF- PP) 
= xfa vct w, 
where P* lies between Ê, and P,, 


1 


=j ey ET Eo 


U, 
and 
“= (Gate nro) (B, — P,). 
[FPA] f[F XP) 
The fact that v, and w, depend on n has been suppressed. 
Because 


2.0 PA-P) 
Vot at= PIF PA 


and because w, is O(n7') and hence can be ignored for large n, (as we shall 
show rigorously later), (9.2.30) approximately defines a heteroscedastic linear 
regression model. The MIN 7? estimator, denoted £, is defined as the WLS 
estimator applied to (9.2.30) ignoring w,. We can estimate a? by a? obtained 
by substituting P, for P, in (9.2.31). Thus 


(9.2.31) 


~ T -1 T 9n 
p= ($ ax) Y 6; ?x,F-\(P,). (9.2.32) 
t=] rm 


We shall prove the consistency and asymptotic normality of, B (as n goes to œ% 
with T fixed) under Assumptions 9.2.1, 9.2.3, and the following additional 
assumption: 


ASSUMPTION 9.24. lim,.... (n,/n) ^c, #0 for every t=1,2,...,T, 
where T is a fixed integer. 


We assume statement 9.2.4 to simplify the analysis. However, if c, = 0 for 
some f, we can act as if the observations corresponding to that ¢ did not exist 
and impose Assumptions 9.2.3 and 9.2.4 on the remaining observations. 
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Inserting (9.2.30) into (9.2.32) and rearranging terms yield 


- 1Z. “14 Z, 
Vn(B — Bo) = (2 2 m) 4 Y, 7 ?x,(v, + w,). (9.2.33) 
tm t=1 


Because T is fixed, we have by Theorem 3.2.6 


1&8. T 
plim P Y or xxi = Y oxx, (9.2.34) 
t=] t-1 


no 


where 07? = cf? [F^ (P,)|IP,(1— P,)]. Also, using Theorem 3.2.7, we 
obtain 


— 


T | 
— V a7’x,(v,  w,) = V —(0;!) 07; o,x (9.2.35) 
A2 t (v: t 2 Vn t t ome 


X(o71v, + o, !w,) 


LD we Fix g! 
m Y O; X,0, t, 
t-1 
because plim n7!/76;! = 6; ', plim 6; o, = 1, and plim o; !w, = 0. But, be- 
cause (z,) are independent, the vector (01!v,, 02!v,,. . . , O7! v7) converges 
to N(0, L7). Therefore 


T T 
> Or 'x,07'v, > n(o, > 5x. (9.2.36) 
fm] 


t=] 


Finally, we obtain from (9.2.33) through (9.2.36) and Theorem 3.2.7 


vn( B ~h) >N |» (š axi) |. (9.2.37) 
tei 


Because A defined in (9.2.15) is equal to ZZ, 9; ?x,x; under Assumption 
9.2.4, (9.2.17) and (9.2.37) show that the MIN x? estimator has the same 
asymptotic distribution as the MLE.? The MIN 7? estimator is simpler to 
compute than the MLE because the former is explicitly defined as in (9.2.32), 
whereas the latter requires iteration. However, the MIN y? method requires a 
large number of observations in each cell. If the model contains several inde- 
pendent variables, this may be impossible to obtain. In the next subsection we 
shall compare the two estimators in greater detail. 

In the probit model, F-(P,) = d-!(£,). Although b-! does not have an 
explicit form, it can be easily evaluated numerically. The function ®~'{ - ) is 
called the probit transformation. In the logit model we can explicitly write 
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A-\(P,) = log [P,/(1 — Ê), (9.2.38) 


which is called the logit transformation or the logarithm of the odds ratio. 
Cox (1970, p. 33) mentioned the following modification of (9.2.38): 


Az(P,) = log ([P, + 2n,)/[1 — Ê, + Qn,)"!]. (9.2.39) 


This modification has two advantages over (9.2.38): (1) The transformation 
(9.2.39) can always be defined whereas (9.2.38) cannot be defined if P, -—0or 
1. (Nevertheless, it is not advisable to use Cox’s modification when n, is small.) 
(2) It can be shown that EAz (P,) — A^'(P,) is of the order of n72, whereas 
EA-\(P,) — A7XP,) is of the order of n; !. 

In the preceding passages we have proved the consistency and the asymp- 
totic normality of Berkson's MIN y? estimator assuming x; = x, for i € 1. 
However, a situation may occur in practice where a researcher must proceed 
asifx, = x, fori € I, even if x; * x, because individual observations x, are not 
available and only their group mean x, = n; !Z,-, x, is available. (Such an 
example will be given in Example 9.3.1.) In this case the MIN 7? estimator is 
generally inconsistent. McFadden and Reid (1975) addressed this issue and, 
under the assumption of normality of x,, evaluated the asymptotic bias of the 
MIN 7? estimator and proposed a modification of the MIN x? estimator that 
is consistent. 


9.2.6 Comparison of the Maximum Likelihood Estimator and the Minimum 
Chi-Square Estimator 


In a simple model where the vector x, consists of 1 and a single independent 
variable and where T is small, the exact mean and variance of MLE and the 
MIN 7’ estimator can be computed by a direct method. Berkson (1955, 1957) 
did so for the logit and the probit model, respectively, and found the exact 
mean squared error of the MIN 7? estimator to be smaller in all the examples 
considered. 

Amemiya (1980b) obtained the formulae for the bias to the order of n~! and 
the mean squared error to the order of ^? of MLE and MIN 7? in a general 
logit model.? The method employed in this study is as follows: Using (9.2.19) 
and the sampling scheme described in Section 9.2.5, the normal equation 
(9.2.8) is reduced to 


T 


Y, a IP, — A(if)lx, = 0. (9.2.40) 


t=! 
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We can regard (9.2.40) as defining the MLE Ê implicitly as a function of 


B, Ê,,. . . , Pr, say, f- gB,, P,,..., Pp). Expanding g in a Taylor 
series around P,,P,,...,P,and noting that g(P,, P2,- . . , Pr) = fy, we 
obtain 
^ 1 
B- B= > gu, + 3 > > BUM, (9.2.41) 
t t s 


+ 6 > > > Bor UU,, 
t $ r 


where u, = P, — P, and g,, g,,, and g,, denote the first-, second-, and third- 
order partial derivatives of g evaluated at (P,, Pa, . . . , Pr), respectively. 
The bias of the MLE to the order of n^! is obtained by taking the expectation 
ofthe first two terms of the right-hand side of (9.2.41). The mean squared error 
of the MLE to the order of n~? is obtained by calculating the mean squared 
error of the right-hand side of (9.2.41), ignoring the terms of a smaller order 
than n~?. We need not consider higher terms in the Taylor expansion because 

Euf for k 2 5 are at most of the order of n7°. A Taylor expansion for the 
MIN 7? estimator Bi is obtained by expanding the right-hand side of (9.2.32) 
around P,. 

Using these formulae, Amemiya calculated the approximate mean squared 
errors of MLE and the MIN 7? estimator in several examples, both artificial 
and empirical, and found the MIN 7? estimator to have a smaller mean 
squared error than MLE in all the examples considered. However, the differ- 
ence between the two mean squared error matrices can be shown to be neither 
positive definite nor negative definite (Ghosh and Sinha, 1981). In fact, Davis 
(1984) showed examples in which the MLE has a smaller mean squared error 
to the order of n^? and offered an intuitive argument that showed that the 
greater 7, the more likely MLE is to have a smaller mean squared error. 

Amemiya also derived the formulae for the n~?-order mean squared errors 
of the bias-corrected MLE and the bias-corrected MIN Y? estimator and 
showed that the former is smaller. The bias-corrected MLE is defined as 
B- B( Ê, where B is the bias to the order of n~!, and similarly for MIN 72. 
This result is consistent with the second-order efficiency of MLE in the expo- 
nential family proved by Ghosh and Subramanyam (1974), as mentioned in 
Section 4.2.4. The actual magnitude of the difference of the n ?-order mean 
squared errors ofthe bias-corrected MLE and MIN x? in Amemiya's examples 
was always found to be extremely small. Davis did not report the correspond- 
ing results for her examples. 
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Smith, Savin, and Robertson (1984) conducted a Monte Carlo study of a 
logit model with one independent variable and found that although in point 
estimation MIN 7? did better than MLE, as in the studies of Berkson and 
Amemiya, the convergence of the distribution of the MIN 7? estimator to a 
normal distribution was sometimes disturbingly slow, being unsatisfactory in 
one instance even at n = 480. 

For further discussion of this topic, see the article by Berkson (1980) and the 
comments following the article. 


9.2.7 Tests of Hypotheses 


To test a hypothesis on a single parameter, we can perform a standard normal 
test using the asymptotic normality of either MLE or the MIN 7? estimator. A 
linear hypothesis can be tested using general methods discussed in Section 
4.5.1. The problem of choosing a model among several alternatives can be 
solved either by the Akaike Information Criterion (Section 4.5.2) or by Cox’s 
test of nonnested hypotheses (Section 4.5.3). For other criteria for choosing 
models, see the article by Amemiya (1981). 

Here we shall discuss only a chi-square test based on Berkson’s MIN y? 
estimator as this is not a special case of the tests discussed in Section 4.5. The 
test statistic is the weighted sum of squared residuals (WSSR) from Eq. 
(9.2.30) defined by 


T a ~ 
WSSR = Y 67?[F-\(P,) — xiBP. (9.2.42) 
t=] 


In the normal heteroscedastic regression model y ~ N(X, D) with known D, 
(y — Xf, yD-Xy — Xho) i is distributed as y?. . . From this fact we can deduce 
that WSSR defined in (9.2.42) is asymptotically distributed as 73x. 

We can use this fact to choose between the unconstrained model P, = 
F(x;flj) and the constrained model P, = F(x;{,8,9), where x,, and f, are the 
first K — q elements of x, and fy, respectively. Let WSSR, and WSSR, be the 
values of (9.2.42) derived from the unconstrained and the constrained 
models, respectively. Then we should choose the unconstrained model if and 
only if 


WSSR, — WSSR, > X2 a» (9.2.43) 


where x2 , denotes the a% critical value of 2. Li (1977) has given examples of 
the use of this test with real data. 
To choose among nonnested models, we can use the following variation of 
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the Akaike Information Criterion: 
AIC = i WSSR + K. (9.2.44) 


This may be justified on the grounds that in the normal heteroscedastic 
regression model mentioned earlier, (y — Xf; D-y — XB) is equal to 
—2 log L aside from a constant term. 

Instead of (9.2.42) we can also use 


Y n{P,1 — BLP, — Fo, (9.2.45) 


tml 


because this is asymptotically equivalent to (9.2.42). 


9.2.8 Discriminant Analysis 


The purpose of discriminant analysis is to measure the characteristics of an 
individual or an object and, on the basis of the measurements, to classify the 
individual or the object into one of two possible groups. For example, accept 
or reject a college applicant on the basis of examination scores, or determine 
whether a particular skull belongs to a man or an anthropoid on the basis ofits 
measurements. 

We can state the problem statistically as follows: Supposing that the vector 
of random variables x* is generated according to either a density g, or go, we 
are to classify a given observation on x*, denoted x}, into the group character- 
ized by either g, or go. It is useful to define y; = 1 if xf is generated by g, and 

= 0 if it is generated by g). We are to predict y, on the basis of x. The 
essential information needed for the prediction is the conditional probability 
P(y; = V|xf). We shall ignore the problem of prediction given the conditional 
probability and address the question of how to specify and estimate the condi- 
tional probability.* 

By Bayes's rule we have 


a (xt)a 
&(xf)a + B(x) G0" 


where q, and qq denote the marginal probabilities P(y; = 1) and P(y; = 0), 
respectively. We shall evaluate (9.2.46), assuming that g, and g, are the densi- 
ties of N(y, , £,) and N( Ho, X), respectively. We state this assumption for- 
mally as 


xz|(y; = D) ~ N(u, Xi) (9.2.47) 
xf|(y; = 0) ~ N( ko, Xo). 


P(y;— 1|x?) = (9.2.46) 
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This is the most commonly used form of discriminant analysis, sometimes 
referred to as normal discriminant analysis. 
Under (9.2.47), (9.2.46) reduces to the following quadratic logit model: 


P(y, = 1x?) = A(Bay + Box? + x Axt), (9.2.48) 
where 
lo lo 
Bay = 7 Kožo Mo — 7 Eus + log q, — log do (9.2.49) 
— liog|x,] + 2 tog [x 
z 10g |21] + 7 log [Žo], 
Bay = Xi'u, — Èo Ho» (9.2.50) 
and 
1 
A= 5 (Zo! ~ 27’). (9.2.51) 


In the special case 2, = X, which is often assumed in econometric applica- 
tions, we have A = 0; therefore (9.2.48) further reduces to a linear logit model: 


P(y, = 1x?) = A(x), (9.2.52) 


where we have written fi, + fox? = x;f to conform with the notation of 
Section 9.2.1. 

Let us consider the ML estimation of the parameters 4 , Ho, E, , Zo, q,, and 
do based on observations (y;, x), i= 1, 2,. . . , n. The determination of q, 
and qy varies with authors. We shall adopt the approach of Warner (1963) and 
treat q, and gy as unknown parameters to estimate. The likelihood function 
can be written as 


L= [[ le. oai Pigra]. (9.2.53) 
i=! 
Equating the derivatives of log L to 0 yields the following ML estimators: 


n 
a= PE (9.2.54) 


a= No (9.2.55) 
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where 4) =n— n, 


y= l > xt (9.2.56) 

n, fei 
Áo = + > 0-y»Xx., (9.2.57) 

No =i 

a 1 n ^ 

i, = "i Y y(xf — xf — uy, (9.2.58) 

1 imi 

and 

= x > (1 — y)(xf — fg) (xP — fig)’. (9.2.59) 


If X, = Xo(=%) as is often assumed, (9.2.58) and (9.2.59) should be re- 
placed by 


ER. ` ` 
m" p y(xf — fy) (xP — AY (9.2.60) 


+ 2 (1 — y) (x7 — ĝo) (x? ~ joy’). 
The ML estimators of £1), 82), and A are obtained by inserting these estimates 
into the right-hand side of (9.2.49), (9.2.50), and (9.2.51). 

Discriminant analysis is frequently used in transport modal choice analysis. 
See, for example, articles by Warner (1962) and McGillivray (1972). 

We call the model defined by (9.2.47) with X, = £, and by (9.2.52) the 
discriminant analysis (DA) model and call the estimator of f = (By), fi; 
obtained by inserting (9.2.56), (9.2.57), and (9.2.60) into (9.2.49) and (9.2.50) 
with Z, = Zp the DA estimator, denoted pa. In contrast, if we assume only 
(9.2.52) and not (9.2.47), we have a logit model. We denote the logit MLE of f 
by f, . In the remainder of this section we shall compare these two estimators. 

The relative performance of the two estimators will critically depend on the 
assumed true distribution for x*. If (9.2.47) with Z, = X, is assumed in addi- 
tion to (9.2.52), the DA estimator is the genuine MLE and therefore should be 
asymptotically more efficient than the logit MLE. However, if (9.2.47) is not 
assumed, the DA estimator loses its consistency in general, whereas the logit 
MLE retains its consistency. Thus we would expect the logit MLE to be more 
robust. 

Efron (1975) assumed the DA model to be the correct model and studied the 
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loss of efficiency that results if f is estimated by the logit MLE. He used the 
asymptotic mean of the error rate as a measure of the inefficiency of an 
estimator. Conditional on a given estimator £ (be it Bp, or fA), the error rate is 
defined by 

Error Rate = P[x’B = 0|x ~ NU, X)]ao (9.2.61) 

+ P[x 8 < O|x ~ Mu, Dla: 
= qo D|(E'ZB) uoh] + 9, P[— (A28) 7 ui]. 

Efron derived the asymptotic mean of (9.2.61) for each of the cases Ê = Boa 
and f = f, , using the asymptotic distributions of the two estimators. Defining 
the relative efficiency of the logit ML estimator as the ratio of the asymptotic 
mean of the error rate of the DA estimator to that of the logit ML estimator, 
Efron found that the efficiency ranges between 40 and 90% for the various 
experimental parameter values he chose. 

Press and Wilson (1978) compared the classification derived from the two 
estimators in two real data examples in which many of the independent 
variables are binary and therefore clearly violate the DA assumption (9.2.47). 
Their results indicated a surprisingly good performance by DA (only slightly 
worse than the logit MLE) in terms of the percentage of correct classification 
both for the sample observations and for the validation set. 

Amemiya and Powell (1983), motivated by the studies of Efron, Press, and 
Wilson, considered a simple model with characteristics similar to the two 
examples of Press and Wilson and analyzed it using the asymptotic techniques 
analogous to those of Efron. They compared the two estimators in a logit 
model with two binary independent variables. The criteria they used were the 
asymptotic mean ofthe probability of correct classification (PCC) (that is, one 
minus the error rate) and the asymptotic mean squared error. They found that 
in terms of the PCC criterion, the DA estimator does very well — only slightly 
worse than the logit MLE, thus confirming the results of Press and Wilson. For 
all the experimental parameter values they considered, the lowest efficiency of 
the DA estimator in terms of the PCC criterion was 9796. The DA estimator 
performed quite well in terms of the mean squared error criterion as well, 
although it did not do as well as it did in terms of the PCC criterion and it did 
poorly for some parameter values. Although the DA estimator is inconsistent 
in the model they considered, the degree of inconsistency (the difference 
between the probability limit and the true value) was surprisingly small in a 
majority of the cases. Thus normal discriminant analysis seems more robust 
against nonnormality than we would intuitively expect. 
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We should point out, however, that their study was confined to the case of 
binary independent variables; the DA estimator may not be robust against a 
different type of nonnormality. McFadden (1976a) illustrated a rather signifi- 
cant asymptotic bias of a DA estimator in a model in which the marginal 
distribution of the independent variable is normal. [Note that when we spoke 
of normality we referred to each of the two conditional distributions given in 
(9.2.47). The marginal distribution of x* is not normal in the DA model but, 
rather, is a mixture of normals.] Lachenbruch, Sneeringer, and Revo (1973) 
also reported a poor performance of the DA estimator in certain nonnormal 
models. 


9.2.9 Aggregate Prediction 


We shall consider the problem of predicting the aggregate proportion r= 
n^! Y? , yin the QR model (9.2.1). This is often an important practical prob- 
lem for a policymaker. For example, in the transport choice model of Example 
9.2.2, a policymaker would like to know the proportion of people in a commu- 
nity who use the transit when a new fare and the other values of the indepen- 
dent variables x prevail. It is assumed that # (suppressing the subscript 0) has 
been estimated from the past sample. Moreover, to simplify the analysis, we 
shall assume for the time being that the estimated value of $ is equal to the true 
value. 

The prediction of r should be done on the basis of the conditional distribu- 
tion of r given (x;). When n is large, the following asymptotic distribution is 
accurate: 


habe: YE, 13 rar). (9.2.62) 
i=l i=] 


Once we calculate the asymptotic mean and variance, we have all the neces- 
sary information for predicting r. 

If we actually observe every x;, i = 1, 2,. . . , n, we can calculate the mean 
and variance for (9.2.62) straightforwardly. However, because it is more real- 
istic to assume that we cannot observe every x;, we shall consider the problem 
of how to estimate the mean and variance for (9.2.62) in that situation. For 
that purpose we assume that (x;) are i.i.d. random variables with the common 
K-variate distribution function G. Then the asymptotic mean and variance of 
r can be estimated by EF(x'f) and n^! EF(1 — F), respectively, where E is the 
expectation taken with respect to G. 

Westin (1974) studied the evaluation of EF when F — A (logistic distribu- 
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tion) and x ~ N(u,, X,). He noted that the density of p = A(x’) is given by 


f(p- L exp {- l E (4 p ) - (9.2.63) 
J2zo P(1 — p) 2g? i-p , 


where u = u; fl and c? = f'Z.,B. Because the mean of this density does not 
have a closed form expression, we must evaluate fpf( p) dp numerically for 
given values of u and c?. 

McFadden and Reid (1975) showed that if F — «D (standard normal distri- 
bution) and x ~ N(u,, X) as before, we have 


E®(x’B) = ®[(1 + oyy]. (9.2.64) 


Thus the evaluation of EF is much simpler than in the logit case. 

Neither Westin nor McFadden and Reid considered the evaluation of the 
asymptotic variance of r, which constitutes an important piece of information 
for the purpose of prediction. 

Another deficiency of these studies is that the variability due to the estima- 
tion of fl is totally ignored. We shall suggest a partially Bayesian way to deal 
with this problem.5 Given an estimate f of £, we treat fl as a random varja- 
ble with the distribution N( Ê X). An estimate of the asymptotic covariance 
matrix of the estimator B can be used for Z,. We now regard (9.2.62) as the 
asymptotic distribution of r conditionally on (x;) and £. The distribution of r 
conditionally on {x,} but not on £ is therefore given by 


rå n| Ep 1 X F; Ejn ? Y Fil — Fi) (9.2.65) 


i=l 


Finally, if the total observations on (x;) are not available and {x;} can be 
regarded as i.i.d. random variables, we can approximate (9.2.65) by 


r Å N[E,E,F, n E,E,F(1 — F) + E,V;F]. (9.2.66) 


9.3 Multinomial Models 
9.3.1 Statistical Inference 


In this section we shall define a general multinomial QR model and shall 
discuss maximum likelihood and minimum chi-square estimation of this 
model, along with the associated test statistics. In the subsequent sections we 
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shall discuss various types of the multinomial QR model and the specific 
problems that arise with them. 

Assuming that the dependent variable y; takes m; + 1 values 0, 1,2,..., 
m,, we define a general multinomial QR model as 


P(y; =J) = F(x*, 0), i-1,2,...,n and (9.3.1) 
j=1,2,..., M; 


where x* and @ are vectors of independent variables and parameters, respec- 
tively. (Strictly speaking, we should write j as j;, but we shall suppress the 
subscript i.) We sometimes write (9.3.1) simply as P; = Fy. We shall allow the 
possibility that not all the independent variables and parameters are included 
in the argument of every F;;. Note that P(y, = 0)(= F p) need not be specified 
because it must be equal to one minus the sum of the m, probabilities defined 
in (9.3.1). 

It is important to let m; depend on i because in many applications individ- 
uals face different choice sets. For example, in transport modal choice analy- 
sis, traveling by train is not included in the choice set of those who live outside 
of its service area. 

To define the maximum likelihood estimator of @ in the model (9.3.1) it is 
useful to define 2?_,(m, + 1) binary variables 


yy=l if yas (9.3.2) 
=0 if y#), i=1,2,...,” and 
j=0,1,..., Mı 


Then we can write tht log likelihood function as 


n m, 
log L = » 2 Yyy log Fy, (9.3.3) 
c 
which is a natural generalization of (9.2.7). The MLE 6 of 0 is defined as a 
solution of the normal equation à log L/900 = 0. 

Many of the results about the MLE in the binary case hold for the model 
(9.3.1) as well. A reasonable set of sufficient conditions for its consistency and 
asymptotic normality can be found using the relevant theorems of Chapter 4, 
as we have done for the binary case. The equivalence of the method of scoring 
and the NLWLS (NLGLS to be exact) iteration can also be shown. However, 
we shall demonstrate these things under a slightly less general model than 
(9.3.1). We assume 
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P(y, =J) = F(xifh, Xop- . . » XinBu)s (9.3.4) 
i-1,2,...,n and j=1,2,...,m™, 


where H is a fixed integer. In most specific examples of the multinomial QR 
model we have H = m, but this need not be so for this analysis. Note that we 
have now assumed that m does not depend on / to simplify the analysis. 

We restate Assumptions 9.2.1, 9.2.2, and 9.2.3 with obvious modifications 
as follows. 


ASSUMPTION 9.3.1. F; has pania derivatives f4 = àF y/ AXi) and partial 
second-order davati S3 = Off /(xi,B,) for every i j, k, l, and 0 < F,«1 
and f* > 0 for every i, j, and k. 


ASSUMPTION 9.3.2. The parameter space B is an open bounded subset of a 
Euclidean space. 


ASSUMPTION 9.3.3. (x4) are uniformly bounded in i for every h and 
lim, 4, 7! Z7, x4 xj is a finite nonsingular matrix for every A. Furthermore, 
the empirical distribution function of {x,,} converges to a distribution func- 
tion. 


Under these assumptions the MLE B of B —(B1,B5,. . . , By)’ can be 
shown to be consistent and asymptotically normal. We shall derive its asymp- 
totic covariance matrix. 

Differentiating (9.3.3) with respect go $}, we obtain 


óloggL AZ 
ve - Fa Sku (9.3.5) 
8f, =] j-0 
Differentiating (9.3.5) with respect to f; yields 
& log L 


n m 
m PG SLUR 9.3.6 
oB,.08; à A yyFg ff xx ( ) 


n m 
*Y* Y yFjfxaxi- 
i=] j=0 

Taking the expectation and noting 27%, fY = 0, we obtain 

EË 8 log L 
E BoB; 
Define A = {A,,}. Then we can show under Assumptions 9.3.1, 9.3.2, and 
9.3.3 


n m 
TE D FyffyXaXu = Ag. (9.3.7) 
ie 
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/n(f — p) ^ NO, lim nA7!). (9.3.8) 


To show the equivalence between the method-of-scoring and the NLWLS 
iteration (see Amemiya, 1976b), it is convenient to rewrite (9.3.5) and (9.3.7) 
using the following vector notation: 


Yi7 (ns Yn «S Vim) (9.3.9) 
F; = (Fas Fos- -o Fs 

f= Si Shs. Sin) 

A, = E(y; — F)(y; — E) = D(F;) — E,F;, 


where D(F,) is the diagonal matrix the jth diagonal element of which is F;;. 
Then, using the identities 22,F;,= 1 and 27) /% = 0 and noting 


AS = D(F,) + Foll, (9.3.10) 
where | is an m-vector of ones, we obtain from (9.3.5) and (9.3.7) 
dlogL & 
= V xaf Ar (y; F, (9.3.11) 
TA 2 ik (yi i) 
and 
)logL 2 "Te 
— = k^ AT gly. 9.3.12 
op, 6B; 2 Xati A; fixi ( ) 


Suppose f is the initial estimate of f, let F,, f*, and A, be, respectively, F,, f*, 
and A, evaluated at f and define 


— n -, — = 
Ay 7 > xad Aj fixi (9.3.13) 
im] 


and 
_ n D H _ - n =- — — 
= V x, fFA;! p fIxuB, + Y xA A; (y,  F;). (9.3.14) 
f=1 -1 i=! 
Then the second-round estimator B obtained by the method-of-scoring itera- 
tion started from f is given by 
=A- (9.3.15) 
where A = {A u) and € = (€, €4,. . . , C. 
We can interpret (9.3.15) as an NLGLS iteration. From (9.3.4) we obtain 
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y^F,tu, i=1,2,...,2, (9.3.16) 
where Eu; = 0, Eu;u; = A;, and Eu;u; = 0 for i + j. Expanding F, in Taylor 
series around x/,f,, h — 1,2, .. . , H, we obtain 

H 
-F+ V f!x,f, 2 Y f^x/, + u,, 9.3.17 
Yi i X i m^ 4 i iB, u; ( ) 
i-21,2,...,n. 


The approximation (9.3.17) can be regarded as a linear regression model of 
mn observations with a nonscalar covariance matrix. If we estimate A, by A; 
and apply FGLS to (9.3.17), the resulting estimator is precisely (9.3.15). 
Next we shall derive the MIN 7? estimator of f in model (9.3.4), following 
Amemiya (1976b). As in Section 9.2.5, we assume that there are many obser- 
vations with the same value of independent variables. Assuming that the 
vector of independent variables takes T distinct values, we can write (9.3.4) as 


Py = F(xuf,Xxpofi. +» > Xian), (9.3.18) 
t=1,2,...,7 and j=1,2,...,m. 

To define the MIN 7? estimator of f, we must be able to invert m equations in 

(9.3.18) for every t and solve them for H variables x; ĝi, X22,- . . , Xi#By- 


Thus for the moment, assume H = m. Assuming, furthermore, that the Jaco- 
bian does not vanish, we obtain 


Xil = GPa, Pos- tot s Pim) (9.3.19) 
t=1,2,...,T and k=1,2,...,m. 


As in Section 9.2.5 we define r,, = Z;c, yj, where I, is the set of i for which 
Xin = X, for all ^, and B,- r,[n,, where n, is the number of integers in 7, 
Expanding G,(D,, Py, . . « » Py) in Taylor series around (P4, Pos- 

Pm) and using (9.3.19), we obtain 


G,(,, P,, eee PS) =X Bet Y gh(Py — Fy); (9.3.20) 
j=l 
t=1,2,...,7 and k21,2,...,m, 


where gj, = 0G, /0P,;. Equation (9.3.20) is a generalization of (9.2.30), but 
here we write it as an approximate equation, ignoring an error term that 
corresponds to w, in (9.2.30). Equation (9.3.20) is an approximate linear 
regression equation with a nonscalar covariance matrix that depends 
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on A,= E(P, — (P, — E.) =n; "[D(F,) - EF;], where Ê, = (P, Po, 
S Êm) and F, = (Fa, Fas. . . , Fim)’: The MIN x? estimator rg of f is 
defined as FGLS applied to (9.3.20), using Á, =n; 1D(Ê,) — P,P/] as an 
estimator of A,. An example ofthe MIN 7? estimator will be given in Example 
9.3.1 in Section 9.3.2. - 

The consistency and the asymptotic normality of f can be proved by using a 
method similar to the one employed in Section 9.2.5. Let (1^! be the asymp- 
totic covariance matrix of, B. that is, Vn( B— p) ^ NO, lim, .. NQ ?). Then it 
can be deduced from (9.3.20) that the k,/th subblock of Q is given by 


Qu = > X4 (G;A,G, Yu xy, (9.3.21) 
n 


where G/is an m X m matrix the kth row of which isequalto(gl,g24,..., 
£2)and( Jy! denotes the k,/th element of the inverse of the matrix inside 
(. ) Now, we obtain from (9.3.12) 


Ay — Y xy (F/A; 'E, uX, (9.3.22) 
t 


where F/is an m X m matrix the kth row of which is equal to (fA, fs- . . , 
f£). Thus the asymptotic equivalence of MIN x? and MLE follows from the 
identity G7! = F;. 

In the preceding discussion we assumed H — m. If H < m, we can still 
invert (9.3.18) and obtain 


Xil, = GAP, Pe, oe. s Pim); (9.3.23) 
t=1,2,...,7 and k—^1,2,...,H, 


but the choice of function G, is not unique. Amemiya (1976b) has shown that 

the MIN x? estimator is asymptotically efficient only when we choose the 

correct function G, from the many possible ones. This fact diminishes the 

usefulness ofthe method. For this case Amemiya (1977c) proposed the follow- 

ing method, which always leads to an asymptotically efficient estimator. 
Step 1. Use some G, in (9.3.23) and define 


lia = Ga, P,,. t; B. 


Step 2. In (9.3.17) replace F, and f? by F, and f? evaluated at fi, and replace 
XinBn by pm . 

Step 3. Apply FGLS on (9.3.17) using A, evaluated at Z,,. 

We shall conclude this subsection by generalizing WSSR defined by (9.2.42) 
and (9.2.45) to the multinomial case. 
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We shall not write the analog of (9.2.42) explicitly because it would require 
cumbersome notation, although the idea is simple. Instead, we shall merely 
point out that it is of the form (y — X£,)’2~'(y — Xo) obtained from the 
regression equation (9.3.20). Itis asymptotically distributed as chi-square with 
degrees of freedom equal to mT minus the number of regression parameters. 

Next, consider the generalization of (9.2.45). Define the vector F, = F,(). 
Then, the analog of (9.2.45) is 


T ^ ^ = 
WSSR = Y (P, — F,)’Ar (P, — F,) (9.3.24) 


=! 


T ~ 
= 2 n(P, — FVD, + P (P, — F,) 
= 


T m (P -P 
-Yn* P d 
t=] j=0 tj 


It is asymptotically equivalent to the analog of (9.2.42). 


9.3.2 Ordered Models 


Multinomial QR models can be classified into ordered and unordered models. 
In this subsection we shall discuss ordered models, and in the remainder of 
Section 9.3 we shall discuss various types of unordered models. 

A general definition of the ordered model is 


DEFINITION 9.3.1. The ordered model is defined by 
P(y = jix, 9) = p(S;) 


for some probability measure p depending on x and @ and a finite sequence of 
successive intervals (5;) depending on x and such that U,S; = R, the real line. 


A model is unordered if it is not ordered. In other words, in the ordered 
model the values that y takes correspond to a partition of the real line, whereas 
in the unordered model they correspond either to a nonsuccessive partition of 
the real line or to a partition of a higher-dimensional Euclidean space. 

In most applications the ordered model takes the simpler form 


P(y = jix, a, B) = Fla — x'B) — F(a; — x’B), (9.3.25) 
j^9,1...,m, Qt =S, a, 0; t 1, Am+1 = 9, 


for some distribution function F. If F = «b, (9.3.25) defines the ordered probit 
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model; and if F = A, it defines the ordered logit model. Pratt (1981) showed 
that the log likelihood function of the model (9.3.25) based on observations 
(y, Xj) i= 1,2,... , n, on (y, x) is globally concave if f, derivative of F, is 
positive and log fis concave. 

The model (9.3.25) is motivated by considering an unobserved continuous 
random variable y* that determines the outcome of y by the rule 


y=j ifandonlyif aj<y*<aj,,, (9.3.26) 
j70,1...,m. 


If the distribution function of y* — x’ is F, (9.3.26) implies (9.3.25). 

In empirical applications of the ordered model, y* corresponds to a certain 
interpretative concept. For example, in the study ofthe effect of an insecticide 
by Gurland, Lee, and Dahm (1960), y* signifies the tolerance of an insect 
against the insecticide. Depending on the value of y*, y takes three discrete 
values corresponding to the three states of an insect — dead, moribund, and 
alive. In the study by David and Legg (1975), y* is the unobserved price of a 
house, and the observed values of y correspond to various ranges ofthe price of 
a house. In the study by Silberman and Talley (1974), y* signifies the excess 
demand for banking in a particular location and y the number of chartered 
bank offices in the location. See also Example 9.4.1 in Section 9.4.1. 

The use of the ordered model is less common in econometric applications 
than in biometric applications. This must be due to the fact that economic 
phenomena are complex and difficult to explain in terms of only a single 
unobserved index variable. We should be cautious in using an ordered model 
because if the true model is unordered, an ordered model can lead to serious 
biases in the estimation of the probabilities. On the other hand, the cost of 
using an unordered model when the true model is ordered is a loss of efficiency 
rather than consistency. 

We shall conclude this subsection by giving an econometric example of an 
ordered model, which is also an interesting application of the MIN x? method 
discussed in Section 9.3.1. 


EXAMPLE 9.3.1 (Deacon and Shapiro, 1975). In this article Deacon and 
Shapiro analyzed the voting behavior of Californians in two recent referenda: 
Rapid Transit Initiative (November 1970) and Coastal Zone Conservation 
Act (November 1972). We shall take up only the former. Let AU, be the 
difference between the utilities resulting from rapid transit and no rapid 
transit for the ith individual. Deacon and Shapiro assumed that AU is distrib- 
uted logistically with mean 4,—that is, P(AU, < x) = A(x — u;)— and that 
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the individual vote is determined by the rule: 
Vote yes if AU,» å (9.3.27) 
Vote no if AU; € —ój 
Abstain otherwise. 


Writing P,(Y ) and P,(N) for the probabilities that the ith individual votes yes 
and no, respectively, we have 


P(Y) = A(u; — 6;) (9.3.28) 
and 
P(N) = A(— u; — à). (9.3.29) 


Deacon and Shapiro assumed u; = xjf, and 6, = xf}, where x; is a vector of 
independent variables and some elements of fl, and fl, are a priori specified to 
be zeros. (Note that if ó; = 0, the model becomes a univariate binary logit 
model.) 

The model (9.3.27) could be estimated by MLE ifthe individual votes were 
recorded and x, were observable. But, obviously, they are not: We only know 
the proportion of yes votes and no votes in districts and observed average 
values of x, or their proxies in each district. Thus we are forced to use a method 
suitable for the case of many observations per cell.? Deacon and Shapiro used 
data on 334 California cities. For this analysis it is necessary to invoke the 
assumption that x, = x, for all i € 7,, where 7, is the set of individuals living in 
the ‘th city. Then we obtain from (9.3.28) and (9.3.29) 


PAY , 
log ISAD = xi( f, -— ha) (9.3.30) 
and 
P(N 
log r oT = —xi(B, + £;). (9.3.31) 


Let PY) and P(N) be the proportion of yes and no votes in the ‘th city. Then, 
expanding the left-hand side of (9.3.30) and (9.3.31) by Taylor series around 
P(Y)and P(N), respectively, we obtain the approximate regression equations 


1204 


1-ÉRY) = xi(fy — ha) (9.3.32) 


log 


1 


+ PY — PAY [P(Y) — P4Y)) 
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and 
BN) __, 
log 1- BAN) =—xi(B, + f) (9.3.33) 
+- By — PAN). 


PAN [1 — P{N)) 


Note that (9.3.32) and (9.3.33) constitute a special case of (9.3.20). The error 
terms of these two equations are heteroscedastic and, moreover, correlated 
with each other. The covariance between the error terms can be obtained from 
the result Cov [P(Y), bB(N)] 2 — n; P(Y)P(N). The MIN 7? estimates of 
(f, — B,) and —(f, + ĝa) are obtained by applying generalized least squares to 
(9.3.32) and (9.3.33), taking into account both heteroscedasticity and the 
correlation.? 


9.3.3 Multinomial Logit Model 


In this and the subsequent subsections we shall present various types of unor- 
dered multinomial QR models. The multinomial logit model is defined by 


m -1 9.3.34 
Pj= [š exp T exp (xl), ( ) 


i=1,2,...,m and j=0,1,...,m,, 


where we can assume Xp = 0 without loss of generality. The log likelihood 
function is given by 


log L = > yy log Pj. (9.3.35) 
imi jm 
Following McFadden (1974), we shall show the global concavity of (9.3.35). 
Differentiating (9.3.35) with respect to $, we obtain 


ôlog L _ Yy OP 
9B T F Py Op’ 


where È, and 2; denote 27, and 27%, respectively. Differentiating (9.3.36) 
further yields 


(9.3.36) 


log L _ yrX E -p ey Fal (9.3.37) 
opp’ TT Py Lopep’ Py op op’ 
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Next, differentiating (9.3.34), we obtain after some manipulation 


ðP; = 
FET PTE (9.3.38) 


where x; = [Z, exp (x, J)] ! Z, exp (x,8)x,,, and 


du = Pio X) Y 6339 


apap’ a 
— Py b exp e] 2 exp (XXX + Pjxix;. 


Inserting (9.3.38) and (9.3.39) into (9.3.37) yields 


g log L _ _ 
m =- 5 » Pj(x — Xi xj X), (9.3.40) 


which, interestingly, does not depend on y;;. Because Pj; > 0 in this model, the 
matrix (9.3.40) is negative definite unless (x; — x;)'a = 0 for every i and j for 
some a + 0. Because such an event is extremely unlikely, we can conclude for 
all practical purposes that the log likelihood function is globally concave in the 
multinomial logit model. 

We shall now discuss an important result of McFadden (1974), which shows 
how the multinomial logit model can be derived from utility maximization. 
Consider for simplicity an individual i whose utilities associated with three 
alternatives are given by 


U,—uy € j=0,1, and 2, (9.3.41) 


ij* 
where 4; is a nonstochastic function of explanatory variables and unknown 
parameters and ej; is an unobservable random variable. (In the following 
discussion, we shall write e; for €; to simplify the notation.) Thus (9.3.41) is 
analogous to (9.2.4) and (9.2.5). As in Example 9.2.2, it is assumed that the 
individual chooses the alternative for which the associated utility is highest. 
McFadden proved that the multinomial logit model is derived from utility 
maximization ifand only if (ej) are independent and the distribution function 
of e; is given by exp [— exp (— €j)]. This is called the Type I extreme-value 
distribution, or log Weibull distribution, by Johnson and Kotz (1970, p. 272), 
who have given many more results about the distribution than are given here. 
Its density is given by exp (—€;) exp [—exp (€,)], which hasa unique mode at 0 
and a mean of approximately 0.577. We shall give only a proof of the if part. 

Denoting the density given in the preceding paragraph by f( * ), we can write 
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the probablity the ith person chooses alternative j as (suppressing the subscript 
i from u; as well as from €) 


P(y, = 2) = P(U5 > Un, Ug > Up) (9.3.42) 
= P(e, + uj — Uy > €, € + Ha — Ho > €) 


e €2tu3—1i €3tu2— Ho 
= [ ffe | Í F(€,) de, - Í fleo) de de; 


= Í exp (— €) exp [— exp (—¢,)] 


X exp [—exp (—€, — 4 + 1)] 
X exp [—exp (— €; — 4; + 1))] de; 


- exp (445) 
exp (Ho) + exp (Ui) + exp (Ha) 


Expression (9.3.42) is equal to Pp given in (9.3.34) if we put ua — uy = xp 
and Jj, — Up = xj; f. The expressions for Py and P, can be similarly derived. 


EXAMPLE 9.3.2. Asan application of the multinomial logit model, consider 
the following hypothetical model of transport modal choice. We assume that 
the utilities associated with three alternatives—car, bus, and train (corre- 
sponding to the subscripts 0, 1, and 2, respectively) — are given by (9.3.41). As 
in (9.2.4) and (9.2.5), we assume 


uy= a+ B+ wiy, (9.3.43) 


where z; is a vector of the mode characteristics and w; is a vector of the ith 
person's socioeconomic characteristics. It is assumed that a, f, and y are 
constant for all i and j. Then we obtain the multinomial logit model (9.3.34) 
with m = 2 if we put Xp = Zp — Zp and X; = Za — Zo- 


The fact that f is constant for all the modes makes this model useful in 
predicting the demand for a certain new mode that comes into existence. 
Suppose that an estimate f of f has been obtained in the model with three 
modes (Example 9.3.2) and that the characteristics z,, of a new mode (desig- 
nated by subscript 3) have been ascertained from engineering calculations and 
a sample survey. Then the probability that the ith person will use the new 
mode (assuming that the new mode is accessible to the person) can be esti- 
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mated by 


P,= EP EA - E 
1 + exp (xi B) + exp (xo, B) + exp (xi) 


where x; = Z;3 — Zp. 

We should point out a restrictive property of the multinomial logit model: 
The assumption of independence of (cj) implies that the alternatives are 
dissimilar. Using McFadden's famous example, suppose that the three alter- 
natives in Example 9.3.2 consist of car, red bus, and blue bus, instead of car, 
bus, and train. In such a case, the independence between e, and e; is a clearly 
unreasonable assumption because a high (low) utility for red bus should 
generally imply a high (low) utility for a blue bus. The probability P, = 
P(U, > U,, Uo > Un) calculated under the independence assumption would 
underestimate the true probability in this case because the assumption ignores 
the fact that the event Uy > U, makes the event Uo > U, more likely. 

Alternatively, note that in the multinomial logit model the relative proba- 
bilities between a pair of alternatives are specified ignoring the third alterna- 
tive. For example, the relative probabilities between car and red bus are 
specified the same way regardless of whether the third alternative is blue bus 
or train. Mathematically, this fact is demonstrated by noting that (9.3.34) 
implies 


(9.3.44) 


P(y;—jly,7j or k)= [exp (xj) + exp (xp) exp (xj). 
(9.3.45) 


McFadden has called this characteristic of the model independence from 
irrelevant alternatives (IIA). 
The following is another example of the model. 


EXAMPLE 9.3.3 (McFadden, 1976b). McFadden (1976b) used a multino- 
mial logit model to analyze the selection of highway routes by the California 
Division of Highways in the San Francisco and Los Angeles Districts during 
the years 1958-1966. The ith project among n = 65 projects chooses one 
from m, routes and the selection probability is hypothesized precisely as 
(9.3.34), where x; is interpreted as a vector of the attributes of route j in 
project i. 

There is a subtle conceptual difference between this model and the model of 
Example 9.3.2. In the latter model, j signifies a certain common type of 
transport mode for all the individuals i. For example, j = 0 means car for all i. 
In the McFadden model, the jth route ofthe first project and the jth route of 
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the second project have nothing substantial in common except that both are 
number j routes. However, this difference is not essential because in this type 
of model each alternative is completely characterized by its characteristics 
vector x, and a common name such as car is just as meaningless as a number j 
in the operation of the model. 

McFadden tested the IIA hypothesis by reestimating one of his models 
using the choice set that consists of the chosen route and one additional route 
randomly selected from m;. The idea is that if this hypothesis is true, estimates 
obtained from a full set of alternatives should be close to estimates obtained by 
randomly eliminating some nonchosen alternatives. For each coefficient the 
difference between the two estimates was found to be less than its standard 
deviation, a finding indicating that the hypothesis is likely to be accepted. 
However, to be exact, we must test the equality of all the coefficients simulta- 
neously. 


Such a test, an application of Hausman's test (see Section 4.5.1), is devel- 
oped with examples in the article by Hausman and McFadden (1984). They 
tested the IIA hypothesis in a trichotomous logit model for which the three 
alternatives are owning an electric dryer, a gas dryer, or no dryer. In one 
experiment, data on the households without a dryer were discarded to obtain a 
consistent but inefficient estimator, and in the other experiment, data on 
those owning electric dryers were discarded. In both experiments Hausman's 
test rejected the IIA hypothesis at less than 196 significance level. Alternative 
tests of the IIA hypothesis will be discussed in Section 9.3.5. 


9.3.4 Multinomial Discriminant Analysis 


The DA model of Section 9.2.8 can be generalized to yield a multinomial DA 
model defined by 


xf (yi =j) ~ Ny, £j) (9.3.46) 
and 
PO, —J) 7 q; (9.3.47) 
fori-1,2,...,nandj—0,1,. . ., m. By Bayes's rule we obtain 
(x®)g, 
P(y; = jix?) = EC (9.3.48) 


» gy xf), 
=0 


300 Advanced Econometrics 


where gis the density function of N(u,, X). Just as we obtained (9.2.48) from 
(9.2.46), we can obtain from (9.3.48) 


PCy = JX?) 


Ply; = Olx*) Albin + Bx? + xf’ Axt), (9.3.49) 


where 81), Ska and A are similar to (9.2.49), (9.2.50), and (9.2.51) except that 
the subscripts 1 and 0 should be changed to j and 0, respectively. 

As before, the term xf" Ax? drops out if all the 2’s are identical. If we write 
By + Byayx? = Bjx;, the DA model with identical variances can be written 
exactly in the form of (9.3.34), except for a modification of the subscripts of f 
and x. 

Examples of multinomial DA models are found in articles by Powers et al. 
(1978) and Uhler (1968), both of which are summarized by Amemiya (1981). 


9.3.5 Nested Logit Model 


In Section 9.3.3 we defined the multinomial logit model and pointed out its 
weakness when some of the alternatives are similar. In this section we shall 
discuss the nested (or nonindependent) logit model that alleviates that weak- 
ness to a certain extent. This model is attributed to McFadden (1977) and is 
developed in greater detail in a later article by McFadden (1981). We shall 
analyze a trichotomous model in detail and then generalize the results ob- 
tained for this trichotomous model to a general multinomial case. 

Let us consider the red bus—blue bus model once more for the purpose of 
illustration. Let U; = uj + ej, j = 0, 1, and 2, be the utilities associated with 
car, red bus, and blue bus. (To avoid unnecessary complication in notation, 
we have suppressed the subscript i.) We pointed out earlier that it is unreason- 
able to assume independence between e, and €,, although e, may be assumed 
independent of the other two. McFadden suggested the following bivariate 
distribution as a convenient way to take account of a correlation between e, 
and e;: 


F(€,, €) = exp {— [exp (7^ !€i) + exp (—p7'e,)P}, (9.3.50) 
0«ps I. 


Johnson and Kotz (1972, p. 256) called this distribution Gumbel's Type B 
bivariate extreme-value distribution. The correlation coefficient can be shown 
to be 1 — p?. If p = 1 (the case of independence), F(€,, €;) becomes the prod- 
uct of two Type I extreme-value distributions — in other words, the multino- 
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mial logit model. As for €), we assume F(€)) = exp [—exp (— € )] as in the 
multinomial logit model. 
Under these assumptions we can show 


2ma exp (Lp) 
Ply = 0) = ap (ue) + [exp (p) + exp (p HDP 0351) 


and 


_ _ exp (p^ y) 
POs Ny tO) = exp (p^) + exp (97!u3) 0332 


The other probabilities can be deduced from (9.3.51) and (9.3.52). Therefore 
these two equations define a nested logit model in the trichotomous case. By 
dividing the numerator and the denominator of (9.3.51) by exp (uy) and those 
of (9.3.52) by exp (—p^ ‘u,), we note that the probabilities depend on 4t — Ho, 
Hı — Ho, and p. We would normally specify 4; = x;fl, j = 0, 1, 2. The estima- 
tion of fl and p will be discussed for more general nested logit models later. 

The form of these two probabilities is intuitively attractive. Equation 
(9.3.52) shows that the choice between the two similar alternatives is made 
according to a binary logit model, whereas (9.3.51) suggests that the choice 
between car and noncar is also like a logit model except that a certain kind ofa 
weighted average of exp (44) and exp (4) is used. 

To obtain (9.3.51), note that 


P(y=0) = P(U > U., U, > U2) (9.3.53) 
= P( Ho + €o > Hi + €y, Ho t €o > Hz + €) 


d €ot 4o —in €o* 4o ua 
SAET UT m o9 


X exp [—exp (—€)]/(e, €) de | de} des 


= f exp (— €) exp [—exp (—€)] 


X exp (— (exp [—p7 "(€o + 4o — 14)] 

t exp [—p7"(€ + Ho — 4;)]^) de 
= f. exp (— €) exp [—« exp (— €9)] déo 
=a"! 


where œ = 1 + exp (— Ho) [exp (p^!) + exp (p^!u;)]^. 
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To obtain (9.3.52), we first observe that 
P(y = lly #0) = P(U, > U,|U, > Uo or U,> Up) (9.3.54) 
= P(U, > U3), 


where the last equality follows from our particular distributional assumptions. 
Next we obtain 


P(U, > Uz) = P(u, + €, > uj + €) (9.3.55) 


= Í . | Í ve tee) da, | de, 


exp (—€,){1 + exp[—9^!(4 — 4u))^'! 


ll 
P" 


X exp ( —exp (—€,) 
X (1 + exp [—p7'(u, — u2)]^) de, 
7 (1 exp[—2^!(4 — u2)]) 
where in the third equality we used 


OF (E1, €) 


de, ^ [exp (7p 'e) + exp (~me) (9.3.56) 
1 


X exp (— p'e, )F (€, €;). 


The model defined by (9.3.51) and (9.3.52) is reduced to the multinomial 
logit model if p = 1. Therefore the IIA hypothesis, which is equivalent to the 
hypothesis p = 1, can be tested against the alternative hypothesis of the nested 
logit model by any one of the three asymptotic tests described in Section 4.5.1. 
Hausman and McFadden (1984) performed the three tests using the data on 
the households’ holdings of dryers, which we discussed in Section 9.3.3. The 
utilities of owning the two types of dryers were assumed to be correlated with 
each other, and the utility of owning no dryer was assumed to be independent 
of the other utilities. As did Hausman’s test, all three tests rejected the IIA 
hypothesis at less than 1% significance level. They also conducted a Monte 
Carlo analysis of Hausman’s test and the three asymptotic tests in a hypotheti- 
cal trichotomous nested logit model. Their findings were as follows: (1) Even 
with n = 1000, the observed frequency of rejecting the null hypothesis using 
the Wald and Rao tests differed greatly from the nominal size, whereas it was 
better for the LRT. (2) The power of the Wald test after correcting for the size 
was best, with Hausman’s test a close second. 
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Next, we shall generalize the trichotomous nested logit model defined by 
(9.3.51) and (9.3.52) to the general case of m + 1 responses. Suppose that the 


m + 1 integers 0, 1,. . . , m can be naturally partitioned into S groups so 
that each group consists of similar alternatives. Write the partition as 
(0,1,2,...,m)=B,UB,U...UBs, (9.3.57) 


where U denotes the union. Then McFadden suggested the joint distribution 


F(€6,€,.-.- sen) exo | 3 2b exp Cor'e)| 


s-1 EB, 
(9.3.58) 
Then it can be shown that 
cóxee.e[ exp (5; ! "i 
X LL — - (9.3.59) 
n PL «| X exp vray 
£ 
s=1,2,. 15; 
and 
à; exp (p; 'H;) 
P(y=j|j € B,) = = ,, s=],2,..., S. (9.3.60) 
exp (p; 
à, (2; ux) 


Note that (9.3.59) and (9.3.60) are generalizations of (9.3.51) and (9.3.52), 
respectively. Clearly, these probabilities define the model completely. As be- 
fore, we can interpret 


Pr 
Y exp(ox up| 
jc B. 


as a kind of weighted average of exp (1) for j € B,. 

The nested logit model defined by (9.3.59) and (9.3.60) can be estimated by 
MLE, but it also can be consistently estimated by a natural two-step method, 
which is computationally simpler. Suppose we specify u; = xjf. First, the part 
ofthe likelihood function that is the product of the conditional probabilities of 
the form (9.3.60) is maximized to yield an estimate of p;!f. Second, this 
estimate is inserted into the right-hand side of (9.3.59), and the product of 
(9.3.59) over s and i (which is suppressed) is maximized to yield estimates of 
p's and a's (one of the a's can be arbitrarily set). The asymptotic covariance 
matrix of these elements is given in McFadden (1981). 
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In a large-scale nested logit model, the two-step method is especially useful 
because of the near infeasibility of MLE. Another merit of the two-step 
method is that it yields consistent estimates that can be used to start a 
Newton-Raphson iteration to compute MLE, as done by Hausman and 
McFadden. 

We shall present two applications of the nested logit model. 


EXAMPLE 9.3.4 (Small and Brownstone, 1982). Small and Brownstone ap- 
plied a nested logit model to analyze trip timing. The dependent variable takes 
12 values corresponding to different arrival times, and the authors experi- 
mented with various ways of “nesting” the 12 responses, for example, B, = 
(1,2,...,8)and4 B, = (9, 10, 11, 12)or B, = (1, 2,. .., 8, B5 = (9), and 
B, (10, 11, 12). All a’s were assumed to be equal to 1, and various specifica- 
tions of the p's were tried. Small and Brownstone found that the two-step 
estimator had much larger variances than the MLE and often yielded unrea- 
sonable values. Also, the computation of the asymptotic covariance matrix of 
the two-step estimator took as much time as the second-round estimator 
obtained in the Newton-Raphson iteration, even though the fully iterated 
Newton-Raphson iteration took six times as much time. 


EXAMPLE 9.3.5 (McFadden, 1978). A person chooses a community to live 
in and a type of dwelling to live in. There are S communities; integers in B, 
signify the types of dwellings available in community s. Set a, — a constant, 
0, = a constant, and Hea = B’x,4 + a’z,. Then 


[z exp [p (fx. t zu 
Z) X exp xut anb 


P(community c is chosen) — 


ce’ (dea. 
(9.3.61) 
and 
ar . Exp (PP Xa) 
P(dwelling d is chosen | c is chosen) = Par) 
(9.3.62) 


As in this example, a nested logit is useful when a set of utilities can be 
naturally classified into independent classes while nonzero correlation is al- 
lowed among utilities within each class. 
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9.3.6 Higher-Level Nested Logit Model 


The nested logit model defined in the preceding section can be regarded as 
implying two levels of nesting because the responses are classified into S 
groups and each group is further classified into the individual elements. In this 
section we shall considera three-level nested logit. A generalization to a higher 
level can be easily deduced from the three-level case. 

Figure 9.1 shows examples of two-level and three-level nested logit models 
for the case of eight responses. 

Following McFadden (1981), we can generalize (9.3.58) to the three-level 
case by defining 


F(é,€,,. .., Em) (9.3.63) 


semen T) 


Then (9.3.59) and (9.3.60) can be generalized to 


= aa — EXP (up) 
P(y = j|s) ^S exp (lao exp (p) (9.3.64) 
k€ B, 
Pm 
«| exp (4/p,) 
= iA = jc B 
Pp Ply = jlu) = — PHA — — s (9.3.65) 
' > a, > exp (u;/p.) 
rec, JEB: 
Two-Level 
$7 1 2 


Three- Level 


Figure 9.1 Two-level and three-level nested logit models 
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and 
Pal Cu | On 
2 2 a, [z exp m } 
2 Ply = j) = — E Eee (9.3.66) 
& afgal gowo") 
w rEC, jes, 


A three-step estimation method can be defined in which, first, the parame- 
ters in (9.3.64) are estimated by maximizing the product of the terms of the 
form (9.3.64). Then these estimates are inserted into (9.3.65), and the product 
of the terms of the form (9.3.65) is maximized to yield the estimates of its 
parameters. Finally, (9.3.66) is maximized after inserting the estimates ob- 
tained by the first and second steps. 


9.3.7 Generalized Extreme-Value Model 


McFadden (1978) introduced the generalized extreme-value (GEV) distribu- 
tion defined by 


F (6€, €&,. - - + Em) (9.3.67) 
= exp {— G[exp (~€), exp (~ €), . . . , exp (—€,,)]}, 
where G satisfies the conditions, 
(i) G(ui, U2, . >., Um) 29, Ui, Uy, Um 20. 
(ii) G(au,, 0u5,. . ., QUm) = aG(u,,u5,. . ., Um) 
(iii) ii i 20 if k isodd 
=0 if k iseven, k=1,2,...,m. 


If U; =u, + €; and the alternative with the highest utility is chosen as before, 
(9.3.67) implies the GEV model 


= exp (4 )G [exp (4), exp (i), -..,@€xp (Hm) 
P; G[exp (u), exp (uj, . ..,exp(u,) ^ (9.3.68) 


where G; is the derivative of G with respect to its jth argument. 

Both the nested logit model and the higher-level nested logit model dis- 
cussed in the preceding sections are special cases of the GEV model. The only 
known application of the GEV model that is not a nested logit model is in a 
study by Small (1981). 
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The multinomial models presented in the subsequent subsections do not 
belong to the class of GEV models. 


9.3.8 Universal Logit Model 


In Section 9.2.1 we stated that for a binary QR model a given probability 
function G(xf, 8) can be approximated by F[H(xf, 0)] by choosing an appro- 
priate Z(xf, 0) for a given F. When F is chosen to be the logistic function A, 
such a model is called a universal logit model. A similar fact holds in the 
multinomial case as well. Consider the trichotomous case of Example 9.3.2. 
A universal logit model is defined by 


exp (25) 
Pp, =—___*P 8a) 9.3.69 
2 “1+ exp (ga) + exp (8n) 

exp (gi) 
p=— e) 9.3.70 
"^ 1 + exp (ga) + exp (8p) ) | 

and 

P, 1 (9.3.71) 


— 1 exp (gn) + exp (22) 


where g;, and gp are functions of all the explanatory variables of the model — 
Zp, Z5, Zp, and w;. Any arbitrary trichotomous model can be approximated by 
this model by choosing the functions g; and gp appropriately. As long as g; 
and gp depend on all the mode characteristics, the universal logit model does 
not satisfy the assumption of the independence from irrelevant alternatives. 
When the g’s are linear in the explanatory variables with coefficients that 
generally vary with the alternatives, the model is reduced to a multinomial 
logit model sometimes used in applications (see Cox, 1966, p. 65), which 
differs from the one defined in Section 9.3.3. 


9.3.9 Multinomial Probit Model 


Let U, j 20,1, 2,. . ., m, be the stochastic utility associated with the jth 
alternative for a particular individual. By the multinomial probit model, we 
mean the model in which (U)) are jointly normally distributed. Such a model 
was first proposed by Aitchison and Bennett (1970). This model has rarely 
been used in practice until recently because of its computational difficulty 
(except, of course, when m — 1,in which case the model is reduced to a binary 
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probit model, which is discussed in Section 9.2). To illustrate the complexity 
of the problem, consider the case of m = 2. Then, to evaluate P(y = 2), for 
example, we must calculate the multiple integral 


P(y =2) = P(U, > U,, U» Up) (9.3.72) 


eo U: Ui 
= Í Í Í J(Uo, U, , U;) dU, dU, dU, 


where f is a trivariate normal density. The direct computation of such an 
integral is involved even for the case of m = 3.? Moreover, (9.3.72) must be 
evaluated at each step of an iterative method to maximize the likelihood 
function. 

Hausman and Wise (1978) estimated a trichotomous probit model to ex- 
plain the modal choice among driving own car, sharing rides, and riding a bus, 
for 557 workers in Washington, D.C. They specified the utilities by 


X - 
Uy = Bi log xij + 85 log xy + Bis x tei, (9.3.73) 
uU 


where x;, X2, X3, and x, represent in-vehicle time, out-of-vehicle time, in- 
come, and cost, respectively. They assume that (fj, , Bi2, Bj, €y) are indepen- 
dently (with each other) normally distributed with means (f, 2>, f, 0) and 
variances (01, 02, 0, 1). It is assumed that e, are independent both through i 
and j and f/'s are independent through i; therefore correlation between U, and 
U;, occurs because of the same fs appearing for all the alternatives. Hausman 
and Wise evaluated integrals of the form (9.3.72) using series expansion and 
noted that the method is feasible for a model with up to five alternatives. 

Hausman and Wise also used two other models to analyze the same data: 
Multinomial Logit (derived from Eq. 9.3.73 by assuming that ffs are nonsto- 
chastic and e; are independently distributed as Type I extreme-value distribu- 
tion) and Independent Probit (derived from the model defined in the preced- 
ing paragraph by putting 07 = 07, = 03 = 0). We shall call the original model 
Nonindependent Probit. The conclusions of Hausman and Wise are that (1) 
Logit and Independent Probit give similar results both in estimation and in 
the forecast of the probability of using a new mode; (2) Nonindependent 
Probit differs significantly from the other two models both in estimation and 
the forecast about the new mode; and (3) Nonindependent Probit fits best. 
The likelihood ratio test rejects Independent Probit in favor of Nonindepen- 
dent Probit at the 796 significance level. These results appear promising for 
further development of the model. 
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Albright, Lerman, and Manski (1977) (see also Lerman and Manski, 1981) 
developed a computer program for calculating the ML estimator (by a gra- 
dient iterative method) of a multinomial probit model similar to, but slightly 
more general than, the model of Hausman and Wise. Their specification is 


Uj = xf; + €y, (9.3.74) 


where f; ~ M(B, Zp) and € = (Ep, €n, - . ., Eim) ~ MO, Z,). As in Hausman 
and Wise, fl; and €; are assumed to be independent of each other and indepen- 
dent through i. A certain normalization is employed on the parameters to 
make them identifiable. 

An interesting feature of their computer program is that it gives the user the 
option of calculating the probability in the form (9.3.72) at each step of the 
iteration either (1) by simulation or (2) by Clark's approximation (Clark, 
1961), rather than by series expansion. The authors claim that their program 
can handle as many as ten alternatives. The simulation works as follows: 
Consider evaluating (9.3.72), for example. We artificially generate many ob- 
servations on Uy, U, , and U, according to fevaluated at particular parameter 
values and simply estimate the probability by the observed frequency. Clark's 
method is based on a normal approximation of the distribution of max (X, Y ) 
when both X and Y are normally distributed. The exact mean and variance of 
max (X, Y), which can be evaluated easily, are used in the approximation. 
Albright, Lerman, and Manski performed a Monte Carlo study, which 
showed Clark's method to be quite accurate. However, several other authors 
have contradicted this conclusion, saying that Clark's approximation can be 
quite erratic in some cases (see, for example, Horowitz, Sparmann, and Da- 
ganzo, 1982). 

Albright et al. applied their probit model to the same data that Hausman 
and Wise used, and they estimated the parameters of their model by Clark's 
method. Their model is more general than the model of Hausman and Wise, 
and their independent variables contained additional variables such as mode- 
specific dummies and the number of automobiles in the household. They also 
estimated an independent logit model. Their conclusions were as follows: 

1. Their probit and logit estimates did not differ by much. (They compared 
the raw estimates rather than comparing 0P/dx for each independent 
variable.) 

2. They could not obtain accurate estimates of 2, in their probit model. 

3. Based on the Akaike Information Criterion (Section 4.5.2), an increase 
in log L in their probit model as compared to their logit model was not 
large enough to compensate for the loss in degrees of freedom. 
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4. The logit model took 4.5 seconds per iteration, whereas the probit model 
took 60-75 CPU seconds per iteration on an IBM 370 Model 158. 
Thus, though they demonstrated the feasibility of their probit model, the 
gain of using the probit model over the independent logit did not seem to 
justify the added cost for this particular data set. 

The discrepancy between the conclusions of Hausman and Wise and those 
of Albright et al. is probably due to the fact that Hausman and Wise imposed 
certain zero specifications on the covariance matrix, an observation that 
suggests that covariance specification plays a crucial role in this type of model. 


9.3.10 Sequential Probit and Logit Models 


When the choice decision is made sequentially, the estimation of multinomial 
models can be reduced to the successive estimation of models with fewer 
responses, and this results in computational economy. We shall illustrate this 
with a three-response sequential probit model. Suppose an individual deter- 
mines whether y = 2 or y #2 and then, given y + 2, determines whether 
y= 1 or O. Assuming that each choice is made according to a binary probit 
model, we can specify a sequential probit model by 


P, = D(x} f.) (9.3.75) 
and 
P, — [1 — $(x £x; B). (9.3.76) 


The likelihood function of this model can be maximized by maximizing the 
likelihood function of two binary probit models. 

A sequential logit model can be analogously defined. Kahn and Morimune 
(1979) used such a model to explain the number of employment spells a 
worker experienced in 1966 by independent variables such as the number of 
grades completed, a health dummy, a marriage dummy, the number of chil- 
dren, a part-time employment dummy, and experience. The dependent vari- 
able y; is assumed to take one of the four values (0, 1, 2, and 3) corresponding 
to the number of spells experienced by the ith worker, except that y; — 3 
means "greater than or equal to 3 spells." Kahn and Morimune specified 
probabilities sequentially as 


P(y; = 0) = A(x; ĝo), (9.3.77) 
P(y, = My; * 0) = A(xiB,), (9.3.78) 
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and 
Ply, = 2|y; * 0, yi * 1) = A(x;fl;). (9.3.79) 


Note that Kahn and Morimune could have used an ordered logit model 
with their data because we can conjecture that a continuous unobservable 
variable y* (interpreted as a measure of the tendency for unemployment) 
exists that affects the discrete outcome. Specifying y¥ = x/fl + e; would lead to 
one of the ordered models discussed in Section 9.3.2. 


9.4 Multivariate Models 
9.4.1 Introduction 


A multivariate QR model specifies the joint probability distribution of two or 
more discrete dependent variables. For example, suppose there are two binary 
dependent variables y, and y,, each of which takes values 1 or 0. Their joint 
distribution can be described by Table 9.1, where Pj, = P(y, = j, y, = k). The 
model is completed by specifying P, , Pio, and Po, as functions ofindependent 
variables and unknown parameters. (Po, is determined as one minus the sum 
of the other three probabilities.) 

A multivariate QR model is a special case of a multinomial QR model. For 
example, the model represented by Table 9.1 is equivalent to a multinomial 
model for a single discrete dependent variable that takes four values with 
probabilities P,,, Pio, Poi, and Po. Therefore the theory of statistical infer- 
ence discussed in regard to a multinomial model in Section 9.3.1 is valid for a 
multivariate model without modification. 

Inthe subsequent sections we shall present various types of multivariate QR 
models and show in each model how the probabilities are specified to take 
account of the multivariate features of the model. 

Before going into the discussion of various ways to specify multivariate QR 
models, we shall comment on an empirical article that is concerned with a 


Table 9.1 Joint distribution of two binary random variables 


yı 
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multivariate model but in which its specific multivariate features are (perhaps 
justifiably) ignored. By doing this, we hope to shed some light on the distinc- 
tion between a multivariate model and the other multinomial models. 


EXAMPLE 9.4.1 (Silberman and Durden, 1976). Silberman and Durden 
analyzed how representatives voted on two bills (House bill and Substitute 
bill) concerning the minimum wage. The independent variables are the socio- 
economic characteristics of a legislator’s congressional district, namely, the 
campaign contribution of labor unions, the campaign contribution of small 
businesses, the percentage of persons earning less than $4000, the percentage 
of persons aged 16-21, and a dummy for the South. Denoting House bill and 
Substitute bill by H and S, the actual counts of votes on the bills are given in 
Table 9.2. 

The zero count in the last cell explains why Silberman and Durden did not 
set up a multivariate QR model to analyze the data. Instead, they used an 
ordered probit model by ordering the three nonzero responses in the order ofa 
representative's feeling in favor of the minimum wage as shown in Table 9.3. 
Assuming that y* is normally distributed with mean linearly dependent on the 
independent variables, Silberman and Durden specified the probabilities as 


Py = o (xil) (9.4.1) 
and 

Po + Pj = O(xjf + a), (9.4.2) 
where a > 0. 


An alternative specification that takes into consideration the multivariate 
nature of the problém and at the same time recognizes the zero count of the 
last cell may be developed in the form of a sequential probit model (Section 
9.3.10) as follows: 


P(H, = Yes) = ®(x/B,) (9.4.3) 


Table 9.2 Votes on House (H) and Substitute (S) bills 


Vote for Vote for House bill 
Substitute —_—__ 
bill Yes No 
Yes 75 119 


No 205 0 
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Table 9.3 Index of feeling toward the minimum wage 


y* 

(feeling in favor of 
y S H the minimum wage) 
0 Yes No Weakest 
1 Yes Yes Medium 
2 No Yes Strongest 
and 

P(S; = No|H, = Yes) = P(x; f). (9.4.4) 


The choice between these two models must be determined, first, from a 
theoretical standpoint based on an analysis of the legislator’s behavior and, 
second, if the first consideration is not conclusive, from a statistical standpoint 
based on some measure of goodness of fit. Because the two models involve 
different numbers of parameters, adjustments for the degrees of freedom, such 
as the Akaike Information Criterion (Section 4.5.2), must be employed. The 
problem boils down to the question, Is the model defined by (9.4.3) and (9.4.4) 
sufficiently better than the model defined by (9.4.1) and (9.4.2) to compensate 
for a reduction in the degrees of freedom? 


9.4.2 Multivariate Nested Logit Model 


The model to be discussed in this subsection is identical to the nested logit 
model discussed in Section 9.3.5. We shall merely give an example of its use in 
a multivariate situation. We noted earlier that the nested logit model is useful 
whenever a set of alternatives can be classified into classes each of which 
contains similar alternatives. It is useful ina multivariate situation because the 
alternatives can be naturally classified according to the outcome of one ofthe 
variables. For example, in a 2 X 2 case such as in Table 9.1, the four alterna- 
tives can be classified according to whether y, = 1 or y, = 0. Using a parame- 
terization similar to Example 9.3.5, we can specialize the nested logit model to 
a2X2 multivariate model as follows: 


P(y, = 1) = a, exp (ziy)exp (pi!xj,]) + exp(pi!xioB)Y — (9.4.5) 
+ (a, exp (ziy)[exp (pi!x1, ff) + exp (pi!xiof) 
+ ay exp (z5y)lexp (po !x5,8) + exp (po Xop, 


exp (97 'x1,8) 


=— OPY XuP) 0. 9.4.6 
exp (p; xi.) + exp (p; XP) 046) 


P(y; = lly, = 1) 
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and 
exp (Po xo) 
exp (po 'x6,8) + exp (po Xop) 


The two-step estimation method discussed in Section 9.3.5 can be applied to 
this model. 


Py = lly, =0)= (9.4.7) 


9.4.3 Log-Linear Model 


A log-linear model refers to a particular parameterization of a multivariate 
model. We shall discuss it in the context of the 2 X 2 model given in Table 9.1. 
For the moment we shall assume that there are no independent variables and 
that there is no constraint among the probabilities; therefore the model is 
completely characterized by specifying any three of the four probabilities 
appearing in Table 9.1. We shall call Table 9.1 the basic parameterization and 
shall consider two alternative parameterizations. 

The first alternative parameterization is a logit model and is given in Table 
9.4, where d is the normalization chosen to make the sum of probabilities 
equal to unity. The second alternative parameterization is called a log-linear 
model and is given in Table 9.5, where d, again, is a proper normalization, not 
necessarily equal to the d in Table 9.4. 

The three models described in Tables 9.1, 9.4, and 9.5 are equivalent; they 
differ only in parameterization. Parameterizations in Tables 9.4 and 9.5 have 
the attractive feature that the conditional probabilities have a simple logistic 
form. For example, in Table 9.4 we have 


exp (411) 
Qi Ts exp (41) + exp (oi) 


Note, however, that the parameterization in Table 9.5 has an additional 
attractive feature: a, = 0 if and only if y, and y, are independent. The role 


= A(t, — 4o) (9.4.8) 


Table 9.4 Bivariate logit model 


y2 
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Table 9.5 Log-linear model 


» 
n 1 0 
1 qd extartoyz dte% 
0 d-le% d^! 


of 0,2 also can be seen by the following equation that can be derived from 
Table 9.5: 


P(y, = 1y) = A(Q, + a&i). (9.4.9) 
The log-linear parameterization in Table 9.5 can also, be defined by 
P(yi, y2) © EXP (045 yj 054 + 05331 3), (9.4.10) 


where « reads “is proportional to.” This formulation can be generalized to a 
log-linear model of more than two binary random variables as follows (we 
shall write only the case of three variables): 


P(yi, Y2, Y3) € exp (a, y, + ALY. + 05y3 + AV Y2 + Aisi Vs 
+ O23 V2 V3 t 013A Y2J3)- (9.4.11) 


The first three terms in the exponential function are called the main effects. 
Terms involving the product of two variables are called second-order interac- 
tion terms, the product of three variables third-order interaction terms, and so 
on. Note that (9.4.11) involves seven parameters that can be put into a one-to- 
one correspondence with the seven probabilities that completely determine 
the distribution of y; , y2, and y,. Such a model, without any constraint among 
the parameters, is called a saturated model. A saturated model for J binary 
variables involves 27 — 1 parameters. Researchers often use a constrained 
log-linear model, called an unsaturated model, which is obtained by setting 
some of the higher-order interaction terms to 0. 
Example 9.4.2 is an illustration of a multivariate log-linear model. 


EXAMPLE 9.4.2 (Goodman, 1972). Goodman sought to explain whether a 
soldier prefers a Northern camp to a Southern camp (yp) by the race of the 
soldier (y,), the region of his origin (y,), and the present location of his camp 
(North or South) (y,). Because each conditional probability has a logistic 
form, a log-linear model is especially suitable for analyzing a model of this 
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sort. Generalizing (9.4.9), we can write a conditional probability as 
P(yo = 111s Yas V3) = Alo + Ag, yi + 02 y; + Gos ¥3 + Aor2Vi V2 


+ Ogi3 ViV3 + ors V2V3 + 0691231 Vas): 
(9.4.12) 


Goodman looked at the asymptotic ¢ value of the MLE of each a (the MLE 
divided by its asymptotic standard deviation) and tentatively concluded 
Qo12 = 0913 = A123 = 0, called the null hypothesis. Then he proceeded to 
accept formally the null hypothesis as a result of the following testing proce- 
dure. Define É, t—1,2,... ,16,astheobserved frequencies in the 16 cells 
created by all the possible joint outcomes of the four binary variables. (They 
can be interpreted as the unconstrained MLE's ofthe probabilities P,.) Define 
Ê, as the constrained MLE of P, under the null hypothesis. Then we must 
reject the null hypothesis if and only if 
16 (p — fy 
n> d EY > Xie (9.4.13) 
i-i P, 

where n is the total number of soldiers in the sample and x3 „is the œ% critical 
value of x2. [Note that the left-hand side of (9.4.13) is analogous to (9.3.24).] 
Or, alternatively, we can use 


16 p 
2n Y, P, log E > Ka (9.4.14) 
i=l F, 


We shall indicate how to generalize a log-linear model to the case of discrete 
variables that take more than two values. This is done simply by using the 
binary variables defined in (9.3.2). We shall illustrate this idea by a simple 
example: Suppose there are two variables z and y, such that z takes the three 
values 0, 1, and 2 and y, takes the two values 0 and 1. Define two binary (0, 1) 
variables y, and y; by the rule: y, = lifz= 1 and y, = 1 ifz = 2. Then we can 
specify P(z, y) by specifying P(y,, 2, Y3), which we can specify by a log-linear 
model as in (9.4.1 1). However, we should remember one small detail: Because 
in the present case y, y; = 0 by definition, the two terms involving y, y; in the 
right-hand side of (9.4.11) drop out. 

In the preceding discussion we have touched upon only a small aspect ofthe 
log-linear model. There is a vast amount of work on this topic in the statistical 
literature. The interested reader should consult articles by Haberman (1978, 
1979), Bishop, Feinberg, and Holland (1975), or the many references to Leo 
Goodman's articles cited therein. 
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Nerlove and Press (1973) proposed making the parameters of a log-linear 
model dependent on independent variables. Specifically, they proposed the 
main-effect parameters — o , o, and a; in (9.4.1 1)—to be linear combina- 
tions of independent variables. (However, there is no logical necessity to 
restrict this formulation to the main effects.) 

Because in a log-linear model each conditional probability has a logit form 
as in (9.4.12), the following estimation procedure (which is simpler than 
MLE)can be used: Maximize the product ofthe conditional probabilities with 
respect to the parameters that appear therein. The remaining parameters must 
be estimated by maximizing the remaining part of the likelihood function. 
Amemiya (1975b) has given a sketch of a proof of the consistency of this 
estimator. We would expect that the estimator is, in general, not asymptoti- 
cally as efficient as the MLE. However, Monte Carlo evidence, as reported by 
Guilkey and Schmidt (1979), suggests that the loss of efficiency may be minor. 


9.4.4 Multivariate Probit Model 


A multivariate probit model was first proposed by Ashford and Sowden 
(1970) and applied to a bivariate set of data. They supposed that a coal miner 
develops breathlessness (y, = 1) if his tolerance (yf) is less than 0. Assuming 
that yf ~ N(— fix, 1), where x = (1, Age)’, we have 


P(y, = 1) = (fix). (9.4.15) 


They also supposed that a coal miner develops wheeze ( y; = 1) if histolerance 
(y$) against wheeze is less than 0 and that y$ ~ N(— fix, 1). Then we have 


P(y,7 1) = (83x). (9.4.16) 


Now that we have specified the marginal probabilities of y, and y,, the multi- 
variate model is completed by specifying the joint probability 
P(y, ^ 1, y; = 1), which in turn is determined if the joint distribution of yf 
and y? is specified. Ashford and Sowden assumed that y* and y? are jointly 
normal with a correlation coefficient p. Thus 


P(y = 1, y = 1) = F,( Bix, $25), (9.4.17) 


where F, denotes the bivariate normal distribution function with zero means, 
unit variances, and correlation p. 

The parameters $, , $2, and p can be estimated by MLE or MIN 7? (if there 
are many observations per cell for the latter method). Amemiya (1972) has 
given the MIN 7? estimation of this model. 


318 Advanced Econometrics 


Muthén (1979) estimated the following sociological model, which is equiva- 
lent to a bivariate probit model: 


y= if u,<a,+ fl (9.4.18) 
yz = 1 if u,<a,+ Bn 

nxytv 

ui, Uy ~ N(0, 1), v~ N(0, 2). 


Here y; and y, represent responses of parents to questions about their attitudes 
toward their children, 7 is a latent (unobserved) variable signifying parents’ 
sociological philosophy, and x is a vector of observable characteristics of 
parents. Muthén generalized this model to a multivariate model in which y, x, 
n, u, v, and a are all vectors following 


y=! ifu<a+On (9.4.19) 
Bn=Tx+yv 
u ~ NO, D), v ~ N(0, 2), 


where 1 is a vector of ones, and discussed the problem of identification. In 
addition, Lee (1982a) has applied a model like (9.4.19) to a study of the 
relationship between health and wages. 

It is instructive to note a fundamental difference between the multivariate 
probit model and the multivariate logit or log-linear models discussed earlier: 
in the multivariate probit model the marginal probabilities are first specified 
and then a joint probability consistent with the given marginal probabilities is 
found, whereas in the multivariate logit and log-linear models the joint proba- 
bilities or conditional probabilities are specified at the outset. The conse- 
quence of these different methods of specification is that marginal probabili- 
ties have a simple form (probit) in the multivariate probit model and the 
conditional probabilities have a simple form (logit) in the multivariate logit 
and log-linear models. 

Because of this fundamental difference between a multivariate probit 
model and a multivariate logit model, it is an important practical problem for 
a researcher to compare the two types of models using some criterion of 
goodness of fit. Morimune (1979) compared the Ashford-Sowden bivariate 
probit model with the Nerlove-Press log-linear model empirically in a model 
in which the two binary dependent variables represent home ownership (y, ) 
and whether or not the house has more than five rooms (y,). As criteria for 
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comparison, Morimune used Cox’s test (Section 4.5.3) and his own modifica- 
tion of it. He concluded that probit was preferred to logit by either test. 

It is interesting to ask whether we could specify an Ashford-Sowden type 
bivariate logit model by assuming the logistic distribution for yf and y? in the 
Ashford-Sowden model. Although there is no "natural" bivariate logistic 
distribution the marginal distributions of which are logistic (unlike the normal 
case), Lee (1982b) found that Plackett’s bivariate logistic distribution function 
(Plackett, 1965) yielded results similar to a bivariate probit model when 
applied to Ashford and Sowden’s data and Morimune’s data. Furthermore, it 
was computationally simpler. 

Given two marginal distribution functions F(x) and G( y), Plackett's class of 
bivariate distributions H(x, y) is defined by 

_Hi-F-G+H) 
VU F-BYG-H)' 
for any fixed y in (0, œ). 

Unfortunately, this method does not easily generalize to a higher-order 
multivariate distribution, where because of the computational burden of the 
probit model the logit analog of a multivariate probit model would be espe- 
cially useful. Some progress in this direction has been made by Malik and 
Abraham (1973), who generalized Gumbel's bivariate logistic distribution 
(Gumbel, 1961) to a multivariate case. 


(9.4.20) 


9.5 Choice-Based Sampling 
9.5.1 Introduction 


Consider the multinominal QR model (9.3.1) or its special case (9.3.4). Up 
until now we have specified only the conditional probabilities of alternatives 
j=0,1,. . . ; mgivena vector of exogenous or independent variables x and 
have based our statistical inference on the conditional probabilities. Thus we 
have been justified in treating x as a vector of known constants just as in the 
classical linear regression model of Chapter 1. We shall now treat both jand x 
asrandom variables and consider different sampling schemes that specify how 
j and x are sampled. 
First, we shall list a few basic symbols frequently used in the subsequent 
discussion: 
P(jlx, B) or PU) Conditional probability the jth 
alternative is chosen, given the 
exogenous variables x 
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P( jix, By) or Paj) The above evaluated at the true 
value of f 

f(x) True density of x!° 

g(x) Density according to which 
a researcher draws x 

HG) Probability according to which 


a researcher draws j 
QU) = QUID) = SPUI, B)f(x) dx 
Qol) = QUB) = FPCIx, Bo f(x) dx 


Leti-1,2,.. . , nbetheindividuals sampled according to some scheme. 
Then we can denote the alternative and the vector of the exogenous variables 
observed for the ith individual by j; and x,, respectively. 

We consider two types of sampling schemes called exogenous sampling and 
endogenous sampling (or choice-based sampling in the QR model). The first 
refers to sampling on the basis ofexogenous variables, and the second refers to 
sampling on the basis of endogenous variables. The different sampling 
schemes are characterized by their likelihood functions. The likelihood func- 
tion associated with exogenous sampling is given by!! 


L, 7 [I PGix, figo. (9.5.1) 
The likelihood function associated with choice-based sampling is given by 

L= [| Pix BYP QUAE HG) (9.5.2) 
if Q( |B) is unknown and by 

Lo= I Pix, DIKDA HG) (9.5.3) 


if Q(j|Bo) is known. Note that if g(x) = f(x) and H( j) = Q(j|Bo), (9.5.1) and 
(9.5.3) both become 


L* = J] PUlx, Pf), (9.5.4) 
i=] 


which is the standard likelihood function associated with random sampling. 
This is precisely the likelihood function considered in the previous sections. 

Although (9.5.2) may seem unfamiliar, it can be explained easily as follows. 
Consider drawing random variables j and x in the following order. We can first 
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draw j with probability H( j) and then, given j, we can draw x according to the 
conditional density f(x|j). Thus the joint probability is f(x] j) H( ), which by 
Bayes's rule is equal to P(j|x)/(x)Q(J) HU). 

This sampling scherne is different from a scheme in which the proportion of 
people choosing alternative j is a priori determined and fixed. This latter 
scheme may be a more realistic one. (Hsieh, Manski, and McFadden, 1983, 
have discussed this sampling scheme.) However, we shall adopt the definition 
ofthe preceding paragraphs (following Manski and Lerman, 1977) because in 
this way choice-based sampling contains random sampling as a special case 
[Q() = H(j)] and because the two definitions lead to the same estimators 
with the same asymptotic properties. 

Choice-based sampling is useful in a situation where exogenous sampling or 
random sampling would find only a very small number of people choosing a 
particular alternative. For example, suppose a small proportion of residents in 
a particular city use the bus for going to work. Then, to ensure that a certain 
number of bus riders are included in the sample, it would be much less 
expensive to interview people at a bus depot than to conduct a random survey 
of homes. Thus it is expected that random sampling augmented with choice- 
based sampling of rare alternatives would maximize the efficiency of estima- 
tion within the budgetary constraints of a researcher. Such augmentation can 
beanalyzed in the framework of generalized choice-based sampling proposed 
by Cosslett (1981a) (to be discussed in Section 9.5.4). 

In the subsequent subsections we shall discuss four articles: Manski and 
Lerman (1977), Cosslett (1981a), Cosslett (1981b), and Manski and McFad- 
den (1981). These articles together cover the four different types of models, 
varying according to whether fis known and whether Q is known, and cover 
five estimators of fl —the exogenous sampling maximum likelihood estimator 
(ESMLE), the random sampling maximum likelihood estimator (RSMLE), 
the choice-based sampling maximum likelihood estimator (CBMLE), the 
Manski-Lerman weighted maximum likelihood estimator (WMLE), and the 
Manski-McFadden estimator (MME). 

A comparison of RSMLE and CBMLE is important because within the 
framework of choice-based sampling a researcher controls H( j), and the par- 
ticular choice H( j) = Qo(j) yields random sampling. The choice of H(j) is an 
important problem of sample design and, as we shall see later, H( j) = Qo(/) is 
not necessarily an optimal choice. 

Table 9.6 indicates how the definitions of RSMLE and CBMLE vary with 
the four types of model; it also indicates in which article each case is discussed. 
Note that RSMLE = CBMLE if Q is known. ESMLE, which is not listed in 
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Table 9.6 Models, estimators, and cross references 


f Q RSMLE CBMLE WMLE MME 


Known Known MM MAL — 
Max. L* wrt f Max. La wrt f 
subject to subject to 
Qo = [Pf dx Qo = f Pf dx 
Known Unknown MM — — 
Max. L* wrt f. Max. L, wrt. fi. 
Unknown Known C2 (see also MAL MM 


Cosslett, 1978) 
Max. L* wrt Band Max. L,, wrt fl and 


f subject to f subject to 
Qo = f Pf dx. Qo = S Pf dx. 
Unknown Unknown C1 (also proves — — 
asymptotic 
efficiency) 
Max. L* wrt f. Max. L, wrt Band f. 


Note: RSMLE = random sampling maximum likelihood estimator, CBMLE = choice-based 
sampling maximum likelihood estimator, WMLE = Manski-Lerman weighted maximum likeli- 
hood estimator; MME = Manski-McFadden estimator. 

MM = Manski and McFadden (1981); MAL = Manski and Lerman (1977); C2 = Cosslett 
(1981b); C1 = Cosslett (198 1a). 


Table 9.6, is the same as RSMLE except when f is unknown and Q is known. 
In that case ESMLE maximizes L, with respect to f without constraints. 
RSMLE and CBMLE for the case of known Q will be referred to as the 
constrained RSMLE and the constrained CBMLE, respectively. For the case 
of unknown Q, we shall attach unconstrained to each estimator. 


9.5.2 Results of Manski and Lerman 


Manski and Lerman (1977) considered the choice-based sampling scheme 
represented by the likelihood function (9.5.3) —the case where Q is known — 
and proposed the estimator (WMLE), denoted 8,, which maximizes 


$,— 2 wi) log PU Ax;, f), (9.5.5) 


where w(j) = Qo(/)/H(/). More precisely, Ê, is a solution of the normal 
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equation 


OS, — 
"im 0. (9.5.6) 
It will become apparent that the weights w(j) ensure the consistency of the 
estimator. If weights were not used, (9.5.5) would be reduced to the exogenous 
sampling likelihood function (9.5.1), and the resulting estimator to the usual 
MLE (the ESMLE), which can be shown to be inconsistent unless H(j) = 
Qe). 

It should be noted that because WMLE does not depend on f(x), it can be 
used regardless of whether or not fis known. ` 

We shall prove the consistency of the WMLE $£, in a somewhat different 
way from the authors’ proof.!? The basic theorems we shall use are Theorems 
4.1.2 and 4.2.1. We need to make six assumptions. 


ASSUMPTION 9.5.1. The parameter space B is an open subset of Euclidean 
space. 


ASSUMPTION 9.5.2. H(j)> 0 for every /=0,1,...,m. 


ASSUMPTION 9.5.3. 0 log P(j|x, B)/dB8 exists and is continuous in an open 
neighborhood N,(fo) of fij for every j and x. (Note that this assumption 
requires P( j|x, 8) > 0 in the neighborhood.)!? 


ASSUMPTION 9.5.4. P(j|x, B) is a measurable function of j and x for every 
BEB. 


ASSUMPTION 9.5.5. (j, x} are i.i.d. random variables. 
ASSUMPTION 9.5.6. If 8 f, P[P(|x, B) # PCIX, f9)] > 0. 


To prove the consistency of Ê.. we first note that Assumptions 9.5.1, 9.5.2, 
and 9.5.3 imply conditions A and B of Theorem 4.1.2 for S, defined in (9.5.5). 
Next, we check the uniform convergence part of condition C by putting 
ay, 0) = log P(j|x, 8) — E log P(j|x, 8) in Theorem 4.2.1. (Throughout 
Section 9.5, E always denotes the expectation taken under the assumption 
of choice-based sampling; that is, for any g(j, x), Eg(jx)- 
S Uo gU, x) PGIx, Bo)OoU) ! H(J)f(x) dx.) Because 0 < P(Ix, B) <1 for 
f € Y, where Y is some compact subset of N, (flo), by Assumption 9.5.3 we 
clearly have E sup,ey|log P( j) — E log P(j)| € ©. This fact, together with 
Assumptions 9.5.4 and 9.5.5, implies that all the conditions of Theorem 4.2.1 
are fulfilled. Thus, to verify the remaining part of condition C, it only remains 
to show that lim, ,, n^! ES, attains a strict local maximum at ffs. 
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Using Assumption 9.5.5, we have 


ESTY Í w(J)[log PUNPO SOQ HG) dx (9.5.7) 


- Y | [log PUPS) dx 


j=0 


= E* log P(j), 


where E in the left-hand side denotes the expectation taken according 
to the true choice-based sampling scheme, whereas E* after the last equality 
denotes the expectation taken according to the hypothetical random 
sampling scheme. (That is, for any g(jx) E*g(j,x)= 
SZ BU, x)PUIX, Bo) f(x) dx.) But, by Jensen’s inequality (4.2.6) and As- 
sumption 9.5.6, we have 


E* log P(j|x, 8) < E* log P(jix, Bo) for BB). (9.5.8) 


Thus the consistency of WMLE has been proved. 
That the ESMLE is inconsistent under the present model can be shown as 
follows. Replacing w(/) by 1 in the first equality of (9.5.7), we obtain 


n'ES, -5 Í cj[log PCj)]PoG)/(x) dx, (9.5.9) 


where c; = Qo(j) !H(j). Evaluating the derivative of (9.5.9) with respect to f 
at By yields 


àn" !ES, 


.19 C sy 
" la Í [š c,P(jlx pre ax} (9.5.10) 


It is clear that we must generally have c; = 1 for every j in order for (9.5.10) to 
be 0.'4 

The asymptotic normality of B. can be proved with suitable additional 
assumptions by using Theorem 4.1.3. We shall present merely an outline of 
the derivation and shall obtain the asymptotic covariance matrix. The neces- 
sary rigor can easily be supplied by the reader. 

Differentiating (9.5.5) with respect to f yields 


9s, . 2P) 
3B X WU) Bay age 5B (9.5.11) 


Because (9.5.11) is a sum of i.i.d. random variables by Assumption 9.5.5, we 
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can expect n "8S, /0fl), to converge to a normal random variable under 
suitable assumptions. We can show 


Lal x0, — 
where 
A= E[w(jyy'], (9.5.13) 
where y — . log P(j Mol Differentiating (9.5.11) with respect to f yields 
] &P(j;) 


PLOTE JA 


Using the fact that (9.5.14) is a sum of i.i.d. random variables, we can show 


plim z n Lael. = — Ew(Jyy (9.5.15) 
E A, 
because 


| SP) _ 5 aao 
PAJ) afa" |p, A apap’ f(x) dx (9.5.16) 


P dx| =0. 
-| ar [3 È POSO |. 
Therefore, from (9.5.12) and (9.5.15), we conclude 


Vn( By — Bo) > NO, A-!A AW?). (9.5.17) 


As we noted earlier, a researcher controls H(j) and therefore faces an 
interesting problem of finding an optimal design: What is the optimal choice 
of H( j)? We shall consider this problem here in the context of WMLE. Thus 
the question is, What choice of H(j) will minimize A"!A A^? 

First of all, we should point out that H( j) = Q)(j) is not necessarily the 
optimal choice. If it were, it would mean that random sampling is always 
preferred to choice-based sampling. The asymptotic covariance matrix of 
J/n( B,- fo) when H( j) = Qo(j) is (E*yy')!, where E* denotes the expecta- 
tion taken with respect to the probability distribution P(j|x, By) f(x). Writing 


Ew(j) 
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w(j) simply as w, the question is whether 


(Ewyy') !Ew?yy'(Ewyy') ! > (E*yy') '. (9.5.18) 
The answer is no, even though we do have 
(GEwyy') Ew?yy'(Ewyy') | > (Eyy) (9.5.19) 


which follows from the generalized Cauchy-Schwartz inequality 
Eyy’ > Eyx’(Exx’)'Exy’. 

Let us consider an optimal choice of H( j) by a numerical analysis using a 
simple logit model with a single scalar parameter. A more elaborate numerical 
analysis by Cosslett (19816), which addressed this as well as some other quès- 
tions, will be discussed in Section 9.5.5. 

For this purpose it is useful to rewrite A and A as 


A= E*w(J)yy' (9.5.20) 
and 
A ——E*yy. (9.5.21) 


Denoting the asymptotic variance of ¥n( B. — Bo) in the scalar case by V,,(H ), 
we define the relative efficiency of £,, based on the design H(j) by 


VQ) E"? 
ERU) VH) E*DG)T 
because y is a scalar here. 


We shall evaluate (9.5.22) and determine the optimal H( j) that maximizes 
Eff(H ) in a binary logit model: 


Po(1) = A(Box), 


where fi, and x are scalars. 
In this model the determination of the optimal value of h = H(1) is simple 
because we have 


(9.5.22) 


"EVER S 
EWW —, + t (9.5.23) 


where 

a = E,A(fgX) Ex? exp (Box) [1 + exp (&)] ? (9.5.24) 
and 

b= E,J[1— AQfflx)]E,x? exp Qfox)[1 + exp (fp)]?. — (9.525) 
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Because a, b > 0, (9.5.23) approaches œ as h approaches either 0 or 1 and 
attains a unique minimum at 


ht = a — vab if a*b (9.5.26) 


=0.5 if a=b. 
We assume x is binary with probability distribution: 
x=1 with probability p 
=0 with probability 1 — p. 


Then, inserting f, = log[(p + 2Qs — 1)/(p — 2Q, + 1)], where Qo = Qo(1), 
into the right-hand side of (9.5.22), Eff(H ) becomes a function of p, Qo, and A 
alone. In the last five columns of Table 9.7, the values of Eff(H ) are shown for 
various values of p, Qo, and A. For each combination of p and Qo, the value of 
the optimal h, denoted A*, is shown in the third column. For example, if 
p= 0.9 and Q, = 0.75, the optimal value of A is equal to 0.481. When h is set 
equal to this optimal value, the efficiency of WMLE is 1.387. The table shows 
that the efficiency gain of using choice-based sampling can be considerable 
and that A = 0.5 performs well for all the parameter values considered. It can 
be shown that if Q, = 0.5, then h* = 0.5 for all the values of p. 

In the foregoing discussion of WMLE, we have assumed Q,( j) to be known. 
However, it is more realistic to assume that Qo( j) needs to be estimated from a 
separate sample and such an estimate is used to define w(j) in WMLE. 
Manski and Lerman did not discuss this problem except to note in a footnote 


Table 9.7 Efficiency of WLME for various designs in a binary logit model 


h 
p Qo h* h* 0.75 0.5 0.25 0.1 

0.9 0.75 0.481 1.387 1 1.385 1.08 0.531 
0.9 0.25 0.519 1.387 1.08 1.385 1 0.470 
0.9 0.1 0.579 3.548 3.068 3.462 2.25 1 

0.5 0.7 0.337 1.626 0.852 1.471 1.563 1 

0.5 0.3 0.663 1.626 1.563 1.471 0.852 0.36 
0.1 0.525 0.378 1.087 0.625 1.026 1 0.585 


0.1 0.475 0.622 1.087 1 1.026 0.625 0.270 
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that the modified WMLE using an estimate Qo) is consistent as long as 
plim Ô,( j) = Qo(/). To verify this statement we need merely observe that 


n X [Q4U) — Qo JAC)! log PAX: B) (9.5.27) 


S max |I) — QE n7 > HC)" log POX BDL 


The right-hand side of (9.5.27) converges to 0 in probability uniformly in fl by 
Assumptions 9.5.2 and 9.5.3 and by the consistency of 0,( J). 

To show the asymptotic equivalence of the modified WMLE and the origi- 
nal WMLE, we need an assumption about the rate of convergence of Ôl j). 
By examining (9.5.12), we see that for asymptotic equivalence we need 


plim n"? X [Qo is) — QU AC: [9 log PC;y/9flls, = 0. 
(9.5.28) 


Therefore we need 


Ooi) — QYU) = o(n71?). (9.5.29) 


If Q,( j) is the proportion of people choosing alternative jin a separate sample 
of size n,, 


QJ) — QU) = OnT»). (9.5.30) 


Therefore asymptotic equivalence requires that n/n, should converge to 0. See 
Hsieh, Manski, and McFadden (1983) for the asymptotic distribution of the 
WMLE with Q estimated in the case where n/n, converges to a nonzero 
constant. 

An application of WMLE in a model explaining family participation in the 
AFDC program can be found in the article by Hosek (1980). 


9.5.3 Results of Manski and McFadden 


Manski and McFadden (1981) presented a comprehensive summary of all the 
types of models and estimators under choice-based sampling, including the 
results of the other papers discussed elsewhere in Section 9.5. However, we 
shall discuss only those topics that are not dealt with in the other papers, 
namely, the consistency and the asymptotic normality of CBMLE in the cases 
where f is known and Q is either known or unknown and MME in the case 
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where fis unknown and Q is known. Because the results presented here are 
straightforward, we shall only sketch the proof of consistency and asymptotic 
normality and shall not spell out all the conditions needed. 

First, consider the case where both fand Q are known. CBMLE maximizes 
Lo given in (9.5.3) subject to the condition 


Q= | Po Bf) dx, j=1,2,...,m. (9.5.31) 


The condition corresponding to j=0 is redundant because 

jo P(JIx, B) = 1, which will be implicitly observed throughout the following 
analysis. Ignoring the known components of L, CBMLE is defined as maxi- 
mizing 27, log P( ji|x;, 8) with respect to f subject to (9.5.31). Because the 
above maximand is the essential part of log L* under random sampling, we 
have CBMLE = RSMLE in this case. However, the properties of the estima- 
tor must be derived under the assumption of choice-based sampling and are 
therefore different from the results of Section 9.3. 

To prove consistency, we need merely note that consistency of the uncon- 
strained CBMLE follows from the general result of Section 4.2.2 and that the 
probability limit of the constrained CBMLE should be the same as that of the 
unconstrained CBMLE if the constraint is true. 

To prove asymptotic normality, it is convenient to rewrite the constraint 
(9.5.31) in the form of 


B g(a), (9.5.32) 


where aris a (k — m)-vector, as we did in Section 4.5.1. Then, by a straightfor- 
ward application of Theorem 4.2.4, we obtain 


Vn(&ys. — æo) —> N[0, (G'Eyy'G)"!], (9.5.33) 


where G = [0f/da’],, and y = [ð log P(j)/0f]s,. Therefore, using a Taylor 
series approximation, 


Ba. — Bo = Glân — 05), (9.5.34) 
we obtain 
Vn( Bu. — Bo) > N[0, G(G’Eyy’G)'G’]. (9.5.35) 


As we would expect, Ba. is asymptotically more efficient than WMLE Ê.. In 
other words, we should have 


G(G’Eyy'G)"'G’ £ (Ewyy' y Ewyy'(Ewyy' Y !, (9.5.36) 
for any w and G. This inequality follows straightforwardly from (9.5.19). 
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Second, consider the case where f is known and Q is unknown. Here, 
CBMLE is defined as maximizing (9.5.2) without constraint. The estimator is 
consistent by the result of Section 4.2.2, and its asymptotic normally follows 
from Theorem 4.2.4. Thus 


Vna — Bo) > NIO, (Eyy' — E66’, (9.5.37) 


where ô = [0 log Q/dB],,. As we would expect, Ba. is not as efficient as Ba. 
because 


G(G'Eyy'G)-!G' 5 (Eyy y? s (Eyy' — Ed y. (9.5.38) 


Finally, consider the case where f is unknown and Q is known. We shall 
discuss CBMLE for this model in Section 9.5.5. Here, we shall consider the 
Manski-McFadden estimator (MME), which maximizes 

A ] : j. Z: j. 
Yy = TI PIX BQ) (ji) . (9.5.39) 
mi 2 P(jlx;, B)O4U) ! HG) 
f 


The motivation for this estimator is the following: As we can see from (9.5.3), 
the joint probability of j and x under the present assumption is 


Cj, x) = POX, ASOU ! HG). (9.5.40) 
Therefore the conditional probability of j given x is 


hix) = MAX) (9.5.41) 

Y, AG, x) 

j=0 
which leads to the conditional likelihood function (9.5.39). The estimator is 
computationally attractive because the right-hand side of (9.5.39) does not 
depend on f(x), which is assumed unknown and requires a nonstandard 
analysis of estimation, as we shall see in Sections 9.5.4 and 9.5.5. 

To prove the consistency of the estimator, we observe that 


PUQUY HG) 0.5.42) 


m 


2 PYRY ! HG) 
f£ 


plim 1^! log ¥ = E log 


no 


- | [E* log ACbo)t) dx, 


where ¢(x) = Zo Py()Qo) ! H; f(x) and E* is the expectation taken with 
respect to the true conditional probability Aj( j|x). Equation (9.5.42) is maxi- 
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mized at fj, because ((x) > 0 and 
E* log h(j|x) < E* log Ag(j)x), (9.5.43) 


which, like (9.5.8), is a consequence of Jensen's inequality (4.2.6). 
By a straightforward application of Theorem 4.1.3, we can show 


Vn( Êm — Bo) > NIO, (Eyy' — Eee')71], (9.5.44) 


where € = [ð log 2729 P(j)OoU) 1H(J)/0fl],. The asymptotic covariance 
matrix in (9.5.44) is neither larger nor smaller (in matrix sense) than the 
asymptotic covariance matrix of WMLE given in (9.5.17). 


9.5.4 Results of Cosslett: Part | 


Cosslett (1981a) proved the consistency and the asymptotic normality of 
CBMLE in the model where both fand Q are unknown and also proved that 
CBMLE asymptotically attains the Cramér-Rao lower bound. These results 
require much ingenuity and deep analysis because maximizing a likelihood 
function with respect to a density function fas well as parameters fi creates a 
new and difficult problem that cannot be handled by the standard asymptotic 
theory of MLE. As Cosslett noted, his model does not even satisfy the condi- 
tions of Kiefer and Wolfowitz (1956) for consistency of MLE in the presence 
of infinitely many nuisance parameters. 

Cosslett's sampling scheme is a generalization ofthe choice-based sampling 
we have hitherto considered.!$ His scheme, called generalized choice-based 
sampling, is defined as follows: Assume that the total sample of size n is 
divided into S subsamples, with n, people (7, is a fixed known number) in the 
sth subsample. A person in the sth subsample faces alternatives J,, a subset of 
the total alternatives (0, 1, 2,. . . , m). He or she chooses alternative j with 
probability Q(j)/Q(s), where Q(s) = Zje;, Qj). Given j, he or she chooses a 
vector of exogenous variables x with a conditional density f(x|j)." Therefore 
the contribution of this typical person to the likelihood function is 
Q(s)QU)f(x|j), which can be equivalently expressed as 


Qs)! PUIx, PS). (9.5.45) 


Taking the product of (9.5.45) over all the persons in the sth subsample 
(denoted by /,) and then over all s, we obtain the likelihood function 


s 
L= [T II Ò PUI, Bf) (9.5.46) 


smi iE€l, 


- I Os) PUA, B)f(x,). 
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In the model under consideration, the Q( j)'s are unknown. Therefore we 
may wonder how, in subsample s, alternatives in J, are sampled in such a way 
that the jth alternative is chosen with probability Q(s)-! Q(j). To this question 
Cosslett (198 1b) gave an example of interviewing train riders at a train station, 
some of whom have traveled to the station by their own cars and some of 
whom have come by taxi. Thus this subsample consists of two alternatives, 
each of which is sampled according to the correct probability by random 
sampling conducted at the train station. 

Cosslett’s generalization of choice-based sampling is an attractive one, not 
only because it contains the simple choice-based sampling of the preceding 
sections as a special case (take J, = (5)), but also because it contains an inter- 
esting special case of “enriched” samples.!? That is, a researcher conducts 
random sampling, and if a particular alternative is chosen with a very low 
frequency, the sample is enriched by interviewing more of those who chose 
this particular alternative. Here we have, for example, J, = (0, 1, 2, . . . , m) 
and J,(0), if the Oth alternative is the one infrequently chosen, 

Our presentation of Cosslett’s results will not be rigorous, and we shall not 
spell out all the conditions needed for consistency, asymptotic normality, and 
asymptotic efficiency. However, we want to mention the following identifica- 
tion conditions as they are especially important. 


ASSUMPTION 9.5.6. US, J, =(0, 1,2, ...,m). 


ASSUMPTION 9.5.7. A subset M of integers (1,2, ...,S) such that 
(Usem J) O (Usem J) = 6, where M is the complement of M, cannot be 
found. 


Note that if S= 2 and m — 1, for example, J, = (0) and /, = (1) violate 
Assumption 9.5.7, whereas J, = (0) and J, = (0, 1) satisfy it. Thus simple 
choice-based sampling violates this assumption. Cosslett noted that these 
assumptions are needed for the multinomial logit model but may not be 
needed for other QR models. Cosslett also gave an intuitive reason for these 
assumptions: If alternatives j and k are both contained in some subsample, 
then Q(J)/Q(k) can be estimated consistently, and under Assumption 9.5.7 
we have enough of these ratios to estimate all the Q(/) separately. Thus 
Assumption 9.5.7 would not be needed in the case where Q( j) are known. 

Before embarking on a discussion of Cosslett's results, we must list a few 
more symbols as an addition to those listed in Section 9.5.1: 


P(s) = P(lx, f) = Y, PGIx, 
(s) = P(s|x, f) A (lx, f) 
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Es) = P(s|x, Bo) -Z P(jIx, Bo) 
Õlis) = Y QU) 

JEJ, 
Qs) => QG) 


H(s) = n,/n 
From (9.5.46) we obtain 
log L = Y. log P(jjx,, B) + S, log f(x) (9.5.47) 
im1 fal 
s ~ 
— Y n, log | P(s|x, B)f(x) dx. 
sei 


Our aim is to maximize (9.5.47) with respect to a parameter vector f and a 
function f( ). By observing (9.5.47) we can immediately see that we should 
put f(x) = 0forx #x,;,i=1,2,. .. , n. Therefore our problem is reduced to 
the more manageable one of maximizing 


log L, — z log P( jjx;, f) + > log w; (9.5.48) 


i=] 
S n ~ 
- > n, log > w;P(s|x,, p) 
smi i=] 
with respect to f and w,, i= 1, 2,. . . , n. Differentiating (9.5.48) with re- 
spect to w; and setting the derivative equal to 0 yield 
àlgL, 1  & njP(lx,f) 


Ax TA - =0. (9.5.49) 
i i s= 1 > w, P(s|x,, B) 
i=] 

If we insert (9.5.49) into (9.5.48), we recognize that log L, depends on w; 
only through the S terms X^ w,P(s|x;, .s=1,2,. .. , S. Thus, if we 
define 

A(B) = —25 (9.5.50) 
p w,P(s|x;, p 
=| 


we can, after some manipulation, write log L, asa function of f and (4,(f)) as 
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n n s ~ 
log L, = p log P(jx;, B) — 2 log » AC(B)P(Ix,, 8) (9.5.51) 


S S 
+ 2 n, log A,( 8) — » n, log n,. 


But maximizing (9.5.51) with respect to f and Af), s=12,... ,S, is 
equivalent to maximizing)? 


n n s ~ S 
Q= Y log P(x: P) 7 2 log > A, P(s|x;, B) +2 n, log A, 


i=l s=l 


(9.5.52) 
with respect to f and 4, 5— 1,2,. . . , S. This can be shown as follows. 
Inserting (9.5.50) into (9.5.49) yields 

w= : (9.5.53) 


Se 
n $, A()P GIx;, P) 


s=] 


Multiplying both sides of (9.5.53) by P(sx;, f), summing over i, and using 
(9.5.50) yield 


n, _ č P(sIx;, P) 


= . (9.5.54) 
A, i= s 5 
uU) ! > AL B)P(s|x;, p 
sel 
But setting the derivative of Q with respect to A, equal to 0 yields 
9Q n, A PGxm 
Bi A, > 0. (9.5.55) 


as 
i71 X. a Bix, f) 
$-] 
Clearly, a solution of (9.5.55) is a solution of (9.5.54). 
In the maximization of (9.5.52) with respect to B and (A,, some 


normalization is necessary because  Q(0,,04,,... , ads, A= 
Q(A,, À5,. . . , Ag, B). The particular normalization adopted by Cosslett is 
As = H(S). (9.5.56) 


Thus we define CBMLE of this model as the value of that is obtained by 
maximizing Q with respect to f and {A,} subject to the normalization 
(9.5.56). 

Although our aim is to estimate £, it is of some interest to ask what the MLE 
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of (4,) is supposed to estimate. The probability limits of MLE fand (A) can be 
obtained as the values that maximize plim „e n^'(2. We have 


plim n^!Q = E log P(j,|x;, 8) — E log i A,P(s|x;, B) (9.5.57) 


no 


s 
+ Y, A(s) log 4, 


s-1 


s 


= Y As) P [log PCIx, B) Os)" PUIS, Bo) f(x) dx 


se] 


— y H(s) Ly: Y. 4E Glx, | 
sal $-1 

X Qs)! P(slx, Bo) f(x) dx 

+ p H(s) log 4,. 


s-91 


Differentiating (9.5.57) with respect to 4,, 1 = 1,2,. .. ,S— l, and f yields 
S 
>, Qv)! HG)P Ix, Bo) 
3 — plim n^!Q — — Í =) _____—_—_—. P(t\x, B)f(x) dx 
a > 4Plx, A) 
s=] 
+70, t=1,2,...,8—-1, (9.5.58) 
and 
3 lim 17! (9.5.59) 
ap p „5. 
_ 1 zi ar a 
X HG | & pg. Hap QO PO Bodo) dx 
1 s | óP 
- $us [4— — $2 
= Y aže p ^ 
s™l 


X (3) P(slx, Bo) f(x) dx. 
By studying (9.5.58) we see that the right-hand side vanishes if we put A, = 
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O,(s)~!H(s)Oo(S) and f = By. Next, insert these same values into the right- 
hand side of (9.5.59). Then each term becomes 


X | z FE nonem 


= 22 2 P PG G)OSG)"!f(x) dx = 5 1=0. 


We conclude that 
plim f = f (9.5.60) 
and 


. ¢ _ HG) 
plim A, Bas) . 


It is interesting to note that if we replace A, that appears in the right-hand 
side of (9.5.52) by plim A, given in (9.5.61), Q becomes identical to the loga- 
rithm of ¥ given in (9.5.39). This provides an alternative interpretation of the 
Manski-McFadden estimator. — . 

The asymptotic normality of f and (4,) can be proved by a straightforward, 
although cumbersome, application of Theorem 4.1.3. Although the same 
value of f maximizes log L given in (9.5.47) and Q given in (9.5.52), Q is, 
strictly speaking, not a log likelihood function. Therefore we must use 
Theorem 4.1.3, which gives the asymptotic normality of a general extremum 
estimator, rather than Theorem 4.2.4, which gives the asymptotic normality 
of MLE. Indeed, Cosslett showed that the asymptotic covariance matrix of 

= ( [2 Àj A, e. DM J)' is given by a formula like that given in Theorem 
4 1.3, which takes the form A~'BA™', rather than by [— E9?Q/0a0a/] '!. 
However, Cosslett showed that the asymptotic covariance matrix of f) is equal 
to the first K X K block of [— E9?(2/0a0a" ] !. ` 

Cosslett showed that the asymptotic covariance matrix of fl attains the lower 
bound of the covariance matrix of any unbiased estimator of $. This remark- 
able result is obtained by generalizing the Cramér-Rao lower bound (Theorem 
1.3.1) to a likelihood function that depends on an unknown density function 
as in (9.5.47). 

To present the gist of Cosslett's argument, we must first define the concept 
of differentiation of a functional with respect to a function. Let F(f) be a 
mapping from a function to a real number (such a mapping is called a func- 


(9.5.61) 
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tional), Then we define 


aF| _,, FUte)—- FY) 
Of |. lim € , 


where ¢ is a function for which F( f+ eC) can be defined. Let t, (a K-vector) 
and t; (a scalar) be unbiased estimators of Band Jf(x)¢(x) dx for some function 
C such that f(x) dx = 0 and f(x)? dx = 1. Then Cosslett showed that the 
(2K + 2)-vector 


ry log L ólogL| V 
19 *2» op’ , of t 


has covariance matrix of the form 


[T wo) 
I ROF 


where C is the covariance matrix of (t;, £)’. Because this covariance matrix is 
positive definite, we have C > R(C) !,asin Theorem 1.3.1. Finally, itis shown 
that the asymptotic covariance matrix of fi is equal to the first K X K block of 
max, R(Q)"'. . 

Thus Cosslett seems justified in saying that £ is asymptotically efficient in 
the sense defined in Section 4.2.4. As we remarked in that section, this does 
not mean that fJ has the smallest asymptotic covariance matrix among all 
consistent estimators. Whether the results of LeCam and Rao mentioned in 
Section 4.2.4 also apply to Cosslett's model remains to be shown. 


(9.5.62) 


9.5.5 Results of Cosslett: Part Il 


Cosslett (1981b) summarized results obtained elsewhere, especially from his 
earlier papers (Cosslett, 1978, 1981a). He also included a numerical evalua- 
tion of the asymptotic bias and variance of various estimators. We shall first 
discuss CBMLE ofthe generalized choice-based sample model with unknown 
f and known Q. Cosslett (1981b) merely stated the consistency, asymptotic 
normality, and asymptotic efficiency of the estimator, which are proved in 
Cosslett (1978). The discussion here will be brief because the results are analo- 
gous to those given in the previous subsection. 

The log likelihood function we shall consider in this subsection is similar to 
(9.5.47) except that the last term is simplified because Q is now known. Thus 
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we have 


log L; = p log P(j,|x;, 8) + p log f(x) (9.5.63) 


S 
-7 > n, log Qs), 


s] 


which is to be maximized with respect to f and f subject to the constraints 


Opo(s) = | P(s|x, B)f(x) dx. (9.5.64) 
It is shown that this constrained maximization is equivalent to maximizing 
with respect to 4,/—0,1,2,. ..,m, 
n A H 
Q, = ¥ log PUK» Pc (9.5.65) 


mi 2 APU, B) 
f 


subject to the contraint 27154,Q,(j) = 1. Consistency, asymptotic normality, 
and asymptotic efficiency can be proved in much the same way as in the 
preceding subsection. 

Next, we shall report Cosslett’s numerical analysis, which is in the same 
spirit as that reported in Section 9.5.2 concerning the Manski-Lerman esti- 
mator. Cosslett compared RSEMLE, CBMLE, WMLE, and MME in the sim- 
ple choice-based sample model with funknown and Q known.” Three binary 
QR models (logit, probit, and arctangent) were considered. In each model 
there is only one independent variable, which is assumed to be normally 
distributed. The asymptotic bias and the asymptotic variance of the estima- 
tors are evaluated for different values of f (two coefficients), Q(1), H(1), and 
the mean of the independent variable. The optimal design is also derived for 
each estimator. Cosslett concluded that (1) RSMLE can have a large asymp- 
totic bias; (2) CBMLE is superior to WMLE and MME; (3) a comparison 
between WMLE and MME depends on parameters, especially Q(1); (4) the 
choice of H(1) — 0.5 was generally quite good, leading to only a small loss of 
efficiency compared to the optimal design. (This last finding is consistent with 
the results of our numerical analysis reported in Section 9.5.2.) 
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9.6 Distribution-Free Methods 


In this section we shall discuss two important articles by Manski (1975) and 
Cosslett (1983). Both articles are concerned with the distribution-free estima- 
tion of parameters in QR models—Manski for multinomial models and 
Cosslett for binary models. However, their approaches are different: Manski 
proposed a nonparametric method called maximum score estimation whereas 
Cosslett proposed generalized maximum likelihood estimation in which the 
likelihood function is maximized with respect to both F and f in the model 
(9.2.1). Both authors proved only consistency of their estimator but not 
asymptotic normality. 


9.6.1 Maximum Score Estimator — A Binary Case 


Manski (1975) considered a multinomial QR model, but here we shall define 

hisestimator for a binary QR mode! and shall prove its consistency. Our proof 

will be different from Manski's proof?! We shall then indicate how to extend 

the proof to the case of a multinomial QR model in the next subsection. 
Consider a binary QR model 


P(y-1l)—F(xifojB  i-12,...,n, (9.6.1) 


and define the score function 


S) = $ [vx(xiB = 0) + (1 — y)x(xiB <0), (9.6.2) 


where 
WE) = 1 if event E occurs (9.6.3) 
=0 otherwise. 


Note that the score is the number of correct predictions we would make if we 
predicted y, to be 1 whenever x/f = 0. Manski's maximum score estimator B, 
is defined by 


S,(B,) = sup SP, (9.6.4) 
fe 


where the parameter space B is taken as 


B= (Bif'B = 1). (9.6.5) 
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Clearly, (9.6.5) implies no loss of generality because S,(cf) = S,(8) for any 


positive scalar c. 
Because S,( ff) is not continuous in f, we cannot use Theorem 4.1.1 without 


a modification. However, an appropriate modification is possible by general- 
izing the concept of convergence in probability as follows: 


DEFINITION 9.6.1. Let (Q, A, P) be a probability space. A sequence of not 
necessarily measurable functions g7(w) for w € Q is said to converge to 0 in 
probability in the generalized sense if for any € > 0 there exists A; © A such 
that 


Arc {o| Igr(e)| « e) 
and lim; .,, P(A;7) = 1. 
Using this definition, we can modify Theorem 4.1.1 as follows: 


THEOREM 9.6.1. Make the following assumptions: 

(A) The parameter space O is a compact subset of the Euclidean K-space 
(R*). 

(B) Q-(y, 9) is a measurable function of y for all 0 € ©. 

(C) T—'Q,(@) converges to a nonstochastic continuous function Q(0) in 
probability uniformly in 8 € O as T goes to %, and Q(0) attains a unique global 
maximum at 6,. 

Define 6; as a value that satisfies 


QÊ) = sup Q,(4). (9.6.6) 


Then 6, converges to @ in probability in the generalized sense. 


We shall now prove the consistency of the maximum score estimator with 
the convergence understood to be in the sense of Definition 9.6.1. 


THEOREM 9.6.2. Assume the following: 

(A) F is a distribution function such that F(x) = 0.5 if and only if x = 0. 

(B) (xj) are i.i.d. vector random variables with a joint density function g(x) 
such that g(x) > 0 for all x. 

Then any sequence ( f),) satisfying (9.6.4) converges to f, in probability in 
the generalized sense. 


Proof. Let us verify the assumptions of Theorem 9.6.1. Assumptions A and 
B are clearly satisfied. The verification of assumption C will be done in five 
steps. 
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First, define for 1 > 0 


Sin B) = X aCA + (1 — yv C- xi], (9.6.7) 


where 
y;(x)-0 if xs0 (9.6.8) 
—-Ax if Oc«x«Jr! 
=] if A' sx. 


Because each term of the summation in (9.6.7) minus its expected value 
satisfies all the conditions of Theorem 4.2.1 for a fixed positive A, we can 
conclude that for any e, ô > 0 there exists »,(4), which may depend on 4, such 
that for all n = n,(A) 


- - €|-À 
P E In SG) — QI s] <>: (9.6.9) 


where 


QAP) = EF(x' By) w(x’ B) + ER — F(x’ Polyx p). (9.6.10) 


Second, we have 


sup |n~'S,(B) — n^ S, (9.6.11) 


= sup n p n(xif) — En(x' )| + sup En(x' p) 
-1 


=A tA, 


where 


n(x) = 0 if six (9.6.12) 
=ittd if —4«x«0 
=1~—-Ax if Ozx«A! 


Applying Theorem 4.2.1 to.A,, we conclude that for any e, ô > 0 there exists 
nA), which may depend on 4, such that for all n = n,(A) 


€ ô 
p (a, > £) <5 (9.6.13) 
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We have 
A, = sup P[(x'By < 47]. (9.6.14) 


But, because the right-hand side of (9.6.14) converges to 0 as A — œ because of 
assumption B, we have for all 2 = A, 


P (4, > s) =0. (9.6.15) 


Therefore, from (9.6.11), (9.6.13), and (9.6.15), we conclude that for all n = 
n,(A) and for all A = A, 


P [svp in-!S,(B) — nS p> 3 < 2. (9.6.16) 
Third, define 
Q(B) = EF(x’ Bo) + Í »" [1 — 2F(x'By)]gG0 dx. (9.6.17) 
Then we have 
sup IXA) — Q(B) = sup P[(x'Bf < 477]. (9.6.18) 


Therefore, using the same argument that led to (9.6.15), we conclude that for 
allAz A, 


P [sup IQA) - Q,(B)|> d =0. (9.6.19) 


Fourth, because 
sup |77'S, (8) — QUDI S sup |n-!S,(B) — n^! S,9) (9.6.20) 


+ sup in S, (B) — QLA 
+ sup 12,(B) — QP), 


we conclude from (9.6.9), (9.6.16), and (9.6.19) that for any e, d > 0 we have 
for all n = max[n,(A,), nå] 


Pisup aS A — Q(B) > €] <ð. (9.6.21) 


Qualitative Response Models 343 


Fifth and finally, it remains to show that Q(fl) defined in (9.6.17) attains a 
unique global maximum at fy. This is equivalent to showing 


I [1 — 2F(x'B9)]g(x) dx (9.6.2) 
x'fo «0 


2 Í sca [1 — 2F(x'&y)]e(x) dx if B # Bp. 


But, because 1 — 2F(x’B))>0O in the region (x|x/f « 0) and 1— 
2F(x’ By) « 0 in the region {x|x’B, > 0) by assumption A, (9.6.22) follows 
immediately from assumption B. 


9.6.2 Maximum Score Estimator — A Multinomial Case 


The multinomial QR model considered by Manski has the following struc- 
ture. The utility of the ith person when he or she chooses the jth alternative is 
given by 


Uy = xjfl t €i. i—-1,2,...,A, J=0,1,...,™, 
(9.6.23) 
where we assume 
ASSUMPTION 9.6.1. (€;) are i.i.d. for both i and j. 


ASSUMPTION 9.6.2. (X, Xs- . . > Xim) = Xis a sequence of (m + 1)K-di- 
mensional i.i.d. random vectors, distributed independently of (e;), with a 
joint density g(x) such that g(x) > 0 for all x. 

ASSUMPTION 9.6.3. The parameter space B is defined by B = (f]f'B = 1). 


Each person chooses the alternative for which the utility is maximized. 
Therefore, if we represent the event of the ith person choosing the /th alterna- 
tive by a binary random variable yj, we have 


yy=l if Uj> Ux forall k*j (9.6.24) 
—0 otherwise. 


We need not worry about the possibility ofa tie because of Assumption 9.6.2. 
The maximum score estimator f, is defined as the value of fl that maximizes 
the score2? 
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S = 2 2 Voxx EXA forall kj) (9.6.25) 
i=] j= 


subject to f/B = 1, where x is defined in (9.6.3). We shall indicate how to 
generalize the consistency theorem to the present multinomial model. 

As in the proof of Theorem 9.6.2, we must verify the assumptions of 
Theorem 9.6.1. Again, assumptions A and B are clearly satisfied. Assumption 
C can be verified in a manner very similar to the proof of Theorem 9.6.2. We 
merely note whatever changes are needed for the multinomial model. 

We can generalize (9.6.7) as 


Sin(B) = p P yyVAl(xy — Xi B, (xj — Xa YP... (9.6.26) 


(x; — Xij- p, (Xj = Xi, 41) f. e., (Xy — Xi) B]. 
where 
Vií(Z,,25,...,2,)70 if min (2) 80 (9.6.27) 


=] if min(z)» 4" 
Li 
= À min (z;) otherwise. 
t 
Then the first four steps of the proof of Theorem 9.6.2 generalize straightfor- 
wardly to the present case. 


The fifth step is similar but involves a somewhat different approach. From 
(9.6.25) we obtain 


Q(B) = plim n75,(4) (9.6.28) 


-EXP- lix, By )x(xjP2x,8 forall k#j) 
£ 


= Eh(x, f), 
where (y;) and (x;) are random variables from which i.i.d. observations {y} 


and (x,) are drawn and x = (xg, X{,. . . , x;,)'. First, we want to show 
A(x, B) is uniquely maximized at fy. For this purpose consider maximizing 


h*(x, (4) = Y Po; = I|x, foy (A7) (9.6.29) 
f 


for every x with respect to a nonoverlapping partition (4;) of the space of x. 
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This is equivalent to the following question: Given a particular x), to which 
region should we assign this point? Suppose 


P(y;, = l|Xo, Bo) > P(y; = 11X, Bo) for all j# Jo. (9.6.30) 
Then it is clearly best to assign x, to the region defined by 


P(y,— 1x, Bo) > PCW; = lx, f) forall j* js. —— (9530) 
Thus (9.6.29) is maximized by the partition (4?) defined by 
AP-—(xIx/fo E xiħ for k*j). (9.6.32) 


Clearly, this is also a solution to the more restricted problem of maximizing 
h(x, f). This maximum is unique because we always have a strict inequality in 
(9.6.30) because of our assumptions. Also our assumptions are such that if 
h(x, f) is uniquely maximized for every x at fy, then Eh(x, B) is uniquely 
maximized at fl . This completes the proof ofthe consistency of the maximum 
score estimator in the multonomial case. (Figure 9.2 illustrates the maximiza- 
tion of Eq. 9.6.29 for the case where m = 3 and x is a scalar.) 

The asymptotic distribution of the maximum score estimator has not yet 
been obtained. A major difficulty lies in the fact that the score function is not 
differentiable and hence Theorem 4.1.3, which is based on a Taylor expansion 
of the derivative of a maximand, cannot be used. The degree of difficulty for 
the maximum score estimator seems greater than that for the LAD estimator 
discussed in Section 4.6 — the method of proving asymptotic normality for 
LAD does not work in the present case. In the binary case, maximizing (9.6.2) 


— 1 
. 

" 

. 

. 

. 

. 

. 

. 

. 

. 

. 

e 

. 

. 

. 

. 
— 


—— P (yo *!|x, Bo) 


aseocsoson Ply; =1|x, Bo) 


. 
. 
e 
—————————-----2- 


~----------- P (ya =1|x, Bo) 


4 


A2 


A 
3° 


Figure 9.2 An optimal partition of the space of an independent variable 
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is equivalent to minimizing 2% ,|y; — x(x;f = 0)|. This shows both a similar- 
ity of the maximum score estimator to the LAD estimator and the additional 
difficulty brought about by the discontinuity of the y function. 

Manski (1975) reported results of a Monte Carlo study in which he com- 
pared the maximum score estimator with the logit MLE in a five-response 
multinomial logit model with four unknown parameters and 400 observa- 
tions. The study demonstrated a fairly good performance by the maximum 
score estimator as well as its computational feasibility. The bias of the estima- 
tor is very small, whereas the root mean squared error is somewhere in the 
magnitude of double that of MLE. A comparison of the maximum score 
estimator and MLE under a model different from the one from which MLE is 
derived would be interesting. 


9.6.3 Generalized Maximum Likelihood Estimator 


Cosslett (1983) proposed maximizing the log likelihood function (9.2.7) of a 
binary QR model with respct to f and F, subject to the condition that F is a 
distribution function. The log likelihood function, denoted here as y, is 


wf, F)= $ (log Fox) + (1 yi) lop [1 = Fed). 0.639 


Theconsistency proof of Kieferand Wolfowitz (1 956) applies to this kind of 
model. Cosslett showed how to compute MLE B and F and derived conditions 
for the consistency of MLE, translating the general conditions of Kiefer and 
Wolfowitz into this particular model. The conditions Cosslett found, which 
are not reproduced here, are quite reasonable and likely to hold in most 
practical applications. 

Clearly some kind of normalization is needed on f and F before we maxi- 
mize (9.6.33). Cosslet adopted the following normalization: The constant 
term is 0 and the sum of squares of the remaining parameters is equal to 1. 
Note that the assumption of zero constant term is adopted in lieu of Manski’s 
assumption F(0) = 0.5. We assume that the constant term has already been 
eliminated from the x;f that appears in (9.6.33). Thus we can proceed, assum- 
ing f'f — 1. 

The maximization of (9.6.33) is carried out in two stages. In the first stage 
we shall fix f and maximize v(f, F) with respect to F. Let the solution be 
F(). Then in the second stage we shall maximize y[£, F(f)] with respect to ff. 
Although the second stage presents a more difficult computational problem, 
we shall describe only the first stage because it is nonstandard and concep- 
tually more difficult. 
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The first-stage maximization consists of several steps: 

Step 1. Given £, rank order (x;f). Suppose x < xf €. . . « Xinh, 
assuming there is no tie. Determine a sequence (yj, Yay» + + - » Ym) accord- 
ingly. Note that this is a sequence consisting only of ones and zeros. 

Step 2. Partition this sequence into the smallest possible number of succes- 
sive groups in such a way that each group consists of a nonincreasing se- 
quence. 

Step 3. Calculate the ratio of ones over the number of elements in each 
group. Let a sequence of ratios thus obtained be (r,, r,, . . . , ry), assuming 
there are K groups. If this is a nondecreasing sequence, we are done. We define 
F(x()8) = r; if the (jth observation is in the jth group. 

Step 4. If, however, r; < r; ., for some j, combine the jth and (j — 1)th group 
and repeat step 3 until we obtain a nondecreasing sequence. 


The preceding procedure can best be taught by example: 


Example 1. 

Jo 0 0|1 1 OJI 1 
f(xjp)| 0 3 1 

In this example, there is no need for step 4. 
Example 2. 

Jo 0 O;1 1 0/1011 1 


f(x) 0 3 4 | 1 
Here, the second and third group must be combined to yield 
Vw 00/1 10 1 0/1 1 


F(xgB)| 0 3 1 


Note that F is not unique over some parts of the domain. For example, 
between x(a and x4, in Example 1, F may take any value between 0 and 4. 

Asymptotic normality has not been proved for Cosslett’s MLE $, nor for 
any model to which the consistency proof of Kiefer and Wolfowitz is applica- 
ble. This seems to be as difficult a problem as proving the asymptotic normal- 
ity of Manski's maximum score estimator. 

Cosslett's MLE may be regarded as a generalization of Manski's estimator 
because the latter searches only among one-jump step functions to maximize 
(9.6.33). However, this does not necessarily imply that Cosslett's estimator is 
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superior. Ranking of estimators can be done according to various criteria. If 
the purpose is prediction, Manski’s estimator is an attractive one because it 
maximizes the number of correct predictions. 


9.7 Panel Data QR Models 


Panel data consist of observations made on individuals over time. We shall 
consider models in which these observations are discrete. If observations are 
independent both over individuals and over time, a panel data QR model does 
not pose any new problem. It is just another QR model, possibly with more 
observations than usual. Special problems arise only when observations are 
assumed to be correlated over time. Serial correlation, or temporal correla- 
tion, is certainly more common than correlation among observations of dif- 
ferent individuals or cross-section units. In this section we shall consider 
various panel data QR models with various ways of specifying serial correla- 
tion and shall study the problem of estimation in these models. 


9.7.1 Models with Heterogeneity 


We shall study the problem of specifying a panel data QR model by consider- 
ing a concrete example of sequential labor force participation by married 
women, following the analysis of Heckman and Willis (1977). As is customary 
with a study of labor force participation, it is postulated that a person works if 
the offered wage exceeds the reservation wage (shadow price of time). The 
offered wage depends on the person’s characteristics such as her education, 
training, intelligence, and motivation, as well as on local labor market condi- 
tions. The reservation wage depends on the person’s assets, her husband’s 
wage rate, prices of goods, interest rates, expected value of future wages and 
prices, and the number and age of her children. Let y, = 1 if the ith person 
works at time ¢ and —0 otherwise and let x; be a vector of the observed 
independent variables (a subset of the variables listed above). Then for fixed t 
it seems reasonable to specify a binary QR model 


P(y,71)—7F(x;fy  i=1,2,...,n, (9.7.1) 


where F is an appropriate distribution function. 

However, with panel data, (9.7.1) is not enough because it specifies only a 
marginal probability for fixed ¢ and leaves the joint probability 
P(yn; Yn, + . > Yir) unspecified. (We assume (y,,) are independent over i.) 
The simplest way to specify a joint probablity is to assume independence and 
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specify P(Ya, Yos» . . , Yir) = HL, P(y,). Then we obtain the binary QR 
model studied in Section 9.2, the only difference being that we have nT 
observations here. 

The independence assumption would imply P(y,—l1l|y;,-,—1)— 
P(y, = 1). In other words, once we observe x,, whether or not a woman 
worked in the previous year would offer no new information concerning her 
status this year. Surely, such a hypothesis could not be supported empirically. 

There are two major reasons why we would expect P(y, = 1|y;,-, = 1) ¥ 
PU = 1): 


1. Heterogeneity. There are unobservable variables that affect people dif- 
ferently in regard to their tendency to work.?? 

2. True State Dependence. For each person, present status influences future 
status. For example, once you start doing something, it becomes habitual. 


Heckman and Willis (1977) attempted to take account of only heterogeneity 
in their study. We shall consider how to model true state dependence in 
Section 9.7.2. 

A straightforward approach to take account of heterogeneity is to specify 


P(y, = 1|uj) = F(x, + uj), (9.7.2) 
i=1,2,...,m, t=1,2,...,T7, 


and assume {y,,} are serially independent (that is, over £) conditional on u;. 
Therefore we have (suppressing subscript 7) 


Py, = ly = 1) — Ply, = 1) (9.7.3) 
_ EIF p+ uFi B+ 0) 
7 EF(x;_,B + u) 
_ Cov [F(x/f + u), Fix B + u)] 
i EF(x; f + u) j 


which can be shown to be nonnegative. The joint probability of (yj), t = 
1,2,..., T, is given by 


P(yn;Ym-. + + > Vir) (9.7.4) 


— EF(x/B + u) 


= EAT F(x;,B + uj)"[1— FOB + ur}. 


The likelihood function of the model is the product of (9.7.4) over all individ- 
uals i= 1, 2,. . . , n. It is assumed that (u,} are i.i.d. over individuals. 
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The expectation occurring in (9.7.4) is computationally cumbersome in 
practice, except for the case where F = 6, the standard normal distribution 
function, and u;is also normal. In this case, Butler and Moffitt (1982) pointed 
out that using the method of Gaussian quadrature, an evaluation of (9.7.4) for 
a model with, say, n = 1500 and ¢ = 10 is within the capacity of most re- 
searchers having access to a high-speed computer. We shall come back to a 
generalization of model (9.7.4) in Section 9.7.2. But here we shall discuss a 
computationally simpler model in which (9.7.4) is expressed as a product and 
ratio of gamma functions. 

Heckman and Willis proposed what they called a beta-logistic model, de- 
fined as? 


P(y4 = lu) wu; (9.7.5) 
i—1,2,...,n, t=1,2,..., 7, 
where u; is distributed according to the beta density 


I(a; + bj) 
I (a; (5;) 


0xsus1 a0, b,>0, 


fi(u;) = ue (1— utat, (9.7.6) 


where I (a) = fox^ !e^* dx. It is assumed that {y,,} are serially independent 
conditional on u;. (Independence over individuals is always assumed.) 

The beta density has mean (a+ byta and variance a(a-- b) Ya + 
b + 1)ab and can take various shapes: a bell shape if a, b > 1; monotonically 
increasing if a> 1 and b < 1; monotonically decreasing if a < 1 and b> 1; 
uniform if a = b = 1; and U shaped if a, b « 1. 

We have (suppressing subscript i) 


Piy =l, y= 1) Ev 
P(Y,-1 = 1) Eu 


> Eu= P(y,= 1), 


PY, = ly- = = (9.7.7) 


where the inequality follows from Vu > 0. 

Heckman and Willis postulated 2; = exp (xj@) and b; = exp (x;f), where x; 
is a vector of time-independent characteristics of the ith individual. Thus, 
from (9.7.5), we have 


Pya = 1) = Alxa— 5). (9.7.8) 


Equation (9.7.8) shows that if we consider only marginal probabilities we have 
a logit model; in this sense a beta-logistic model is a generalization of a logit 
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model. By maximizing IT. , 117, A[x/(o — fj)] we can obtain a consistent esti- 
mate ofa — f. However, we can estimate each ofa and fl consistently and also 
more efficiently by maximizing the full likelihood function. If the ith person 
worked s; periods out of 7, i= 1, 2,. . . , n, the likelihood function of the 
beta-logistic model is given by 


L= I E[u3(1 — u;)*^*] (9.7.9) 
i-i 
_ n T(a;+ bj) . T(a; t sb + T — s;) 
iz I (aD (5) I(a; t b; t T) 


An undesirable feature of the beta-logistic model (9.7.9) is that the indepen- 
dent variables x, are time independent. Thus, in their empirical study of 1583 
women observed over a five-year period, Heckman and Willis were forced to 
use the values ofthe independent variables at the first period as x, even though 
some of the values changed with ¢. 


9.7.2 Models with Heterogeneity and True State Dependence 


In this subsection we shall develop a generalization of model (9.7.2) that can 
incorporate both heterogeneity and true state dependence. 

Following Heckman (1981a), we assume that there is an unobservable 
continuous variable y? that determines the outcome ofa binary variable y,, by 
the rule 


YaF l if y4>0 (9.7.10) 
=0 otherwise. 


In a very general model, y? would depend on independent variables, lagged 
values of y*, lagged values of y,,, and an error term that can be variously 
specified. We shall analyze a model that is simple enough to be computation- 
ally feasible and yet general enough to contain most of the interesting features 
of this type of model: Let 


Vie Xi t Vira + Vy = Va t Vis (9.7.11) 


where for each i, v, is serially correlated in general. We define true state 
dependence to mean y ** 0 and heterogeneity to mean serial correlation of (v, ). 

Model (9.7.2) results from (9.7.11) if we put y = 0 and v, = U; + €x, where 
{€x} are serially independent. Thus we see that model (9.7.2) is restrictive not 
only because it assumes no true state dependence but also because it assumes a 
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special form of heterogeneity. This form of heterogeneity belongs to a so- 
called one-factor model and will be studied further in Section 9.7.3. 

Here we shall assume that for each i, (v) are serially correlated in a general 
way and derive the likelihood function of the model. Because {y,,} are as- 
sumed to be independent over i, the likelihood function is the product of 
individual likelihood functions. Therefore we suppress i and consider the 
likelihood function of a typical individual. 

Define 7-vectors y, y, and v the tth elements of which are y,, v,, and v, 
respectively, and assume v ~ N(0, Z). Then the joint probability of y (hence, a 
typical individual likelihood function) conditional on yọ can be concisely 
expressed as 


P(y) = Fly * (2y - 1; Z*Qy - Dy — 1Y], (9.7.12) 


where * denotes the Hadamard product (see Theorem 23 of Appendix 1), lisa 
T-vector of ones, and F(x; €) denotes the distribution function of N(0, X) 
evaluated at x. Note that in deriving (9.7.12) we assumed that the conditional 
distribution of v given yp is N(0, X). This, however, may not be a reasonable 
assumption. 

As with continuous variable panel data discussed in Section 6.6.3, the 
specification of the initial conditions y, (reintroducing subscript i) is an 
important problem in model (9.7.11), where T'is typically small and N is large. 
The initial conditions y, can be treated in two ways. (We do assume they are 
observable. Treating them as N unknown parameters is not a good idea when 
N is large and T is small.) 

1. Treat yj as known constants, 1 or 0. Then the likelihood function is as 
given in (9.7.12). 

2. Assume that y, is a random variable with the probability distribution 
P(y¥ = 1) = P(x), where a is a vector of unknown parameters. 


9.7.3 One-Factor Models 


The individual likelihood function (9.7.12) involves a 7-tuple normal inte- 
gral, and therefore its estimation is computationally infeasible for large T (say, 
greater than 5). For this reason we shall consider in this subsection one-factor 
models that lead to a simplification of the likelihood function. 

We assume 


Dy = OU; + Ex, (9.7.13) 


where {a}, t= 1,2,. . . , T, are unknown parameters, u; and {€,,} are nor- 
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mally distributed independent of each other, and {€,,} are serially indepen- 
dent. We suppress subscript i as before and express (9.7.13) in obvious vector 
notation as 


v=aute, (9.7.14) 


where v, œ, and e are 7-vectors and isa scalar. Then the joint probability of y 
can be written as 


P(y) = E, Fly * (2y - 1); D* Qy - DQy — 1y], (9.7.15) 


where y now includes au and D = Eee’. Because D is a diagonal matrix, F in 
(9.7.15) can be factored as the product of T normal distribution functions. The 
estimation of this model, therefore, is no more difficult than the estimation of 
model (9.7.4). 

For the case T = 3, model (9.7.14) contains a stationary first-order autore- 
gressive model (see Section 5.2.1) as a special case. To see this, put a = 
(1 — Py}? (p, 1, py, Vu = œ, and take the diagonal elements of D as c?, 0, 
and o?. Thus, if T = 3, the hypothesis of AR(1) can easily be tested within the 
more general model (9.7.14). Heckman (1981c) accepted the AR(1) hypoth- 
esis using the same data for female labor participation as used by Heckman 
and Willis (1977). If T > 4, model (9.7.13) can be stationary if and only if a, is 
constant for all t. A verification of this is left as an exercise. 

Consider a further simplification of (9.7.13) obtained by assuming a, — 1, 
(u;) arei.i.d. over i, and (e;,) arei.i.d. both over i and t. This model differs from 
model (9.7.2) only in the presence of y; ,_, among the right-hand variables and 
is analogous to the Balestra-Nerlove model (Section 6.6.3) in the continuous 
variable case. 

Asin the Balestra-Nerlove model, {u,;} may be regarded as unknown param- 
eters to estimate. If both N and T go to o, fj, y, and {u,;} can be consistently 
estimated. An interesting question is whether we can estimate f and y consist- 
ently when only N goes to ©. Unlike the Balestra-Nerlove model, the answer to 
this question is generally negative for the model considered in this subsection. 
In a probit model, for example, the values of fl and y that maximize 


n T 
L-[] I] Pe + uj) — (v, + v)!» (9.7.16) 


i=j t—1 


i 


while treating (uj) as unknown constants, are not consistent. Heckman 
(1981b), in a small-scale Monte Carlo study for a probit model with n = 100 
and T'— 8, compared this estimator (with y» treated as given constants), 
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called the transformation estimator, to the random-effect probit MLE, in 
which we maximize the expectation of (9.7.16) taken with respect to (u;), 
regarded as random variables, and specify the probability of y» under the 
specification 2 given at the end of Section 9.7.2. 

Heckman concluded that (1) if y — 0, the transformation estimator per- 
formed fairly well relative to the random-effect probit MLE; (2) if y # 0, the 
random-effect probit MLE was better than the transformation estimator, the 
latter exhibiting a downward bias in y as in the Balestra-Nerlove model (see 
Section 6.6.3). 


Exercises 


1. (Section 9.2.1) 
In a Monte Carlo study, Goldfeld and Quandt (1972, Chapter 4) gener- 
ated (y) according to the model P(y; = 1) = 40.2 + 0.5x,, + 2x,,;) and, 
using the generated ( y;) and given (x,,, x;;), estimated the fl's in the linear 
probability model P(y; = 1) = fl + B,x,; + Xz. Their estimates were 
fo 0.58, B= 0.1742, and B= = 0.7451. How do you convert these esti- 
mates into the estimates of the coefficients in the probit model? 


2. (Section 9.2.2) 
Consider a logit model P(y; —1) = A(ffg + 4,x;), where x; is a binary 
variable taking values 0 and 1. This model can be also written as a linear 
probability model P(y; = 1) = ys + 7;%;. 
a. Determine y, and y, as functions of fj and f, 
b. Show that the MLE of y, and y, are equal to the least squares 
estimates in the regression of y, on x; with an intercept. 


3. (Section 9.2.3) 
Show that global concavity is not invariant to a one-to-one transforma- 
tion of the parameter space. 


4. (Section 9.2.8) 
In the model of Exercise 2, we are given the following data: 


x 1110000010 | 
y 0010011010 lI. 


Calculate the MLE and the DA estimates (with 2» = X,) of fy and f. 


Qualitative Response Models 355 


5. (Section 9.2.8) 
The following data come from a hypothetical durable goods purchase 
study: 


Case (t) Constant x, nm, T it o-(%) elo- (2)] 


t ne 
1 5 25 12 0.4800 —0.0500 0.3984 
i 7 26 16 0.6154 0.2930 0.3822 
1 10 31 22 0.7097 0.5521 0.3426 
1 15 27 21 0.7778 0.7645 0.2979 


hwWN = 


a. Compute the coefficient estimates f, and b, using the following 
models and estimators: 
(1) Linear Probability Model — Least Squares 
(2) Linear Probability Model — Weighted Least Squares 
(3) Logit Minimum z? 
(4) Probit Minimum x? 
(5) Discriminant Analysis Estimator' 
b. For all estimators except (5), find the asymptotic variance-covar- 
iance matrix (evaluated at the respective coefficient estimates). 
c. For all estimators, obtain estimates of the probabilities P, corre- 
sponding to each case ¢. 
d. Rank the estimators according to each of the following criteria: 


(0 X6 -y. 


A m 
2 + tC E> 
(2) LEB) By 


(3) logL = Y [r, log P, + (n, — 7,) log (1 — £,)]. 


6. (Section 9.2.9) 
It may be argued that in (9.2.62) the asymptotic variance of r should 
be the unconditional variance of r, which is n ?ELL,F/(1— F;) 
V(n Ez, F), where V is taken with respect to random variables (x,). 
What is the fallacy of this argument? 


7. (Section 9.3.3) 
In the multinomial logit model (9.3.34), assume j — 0, 1, and 2. For this 
model define the NLGLS iteration. 
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8. 


11. 


12. 


13. 


(Section 9.3.5) 

Suppose {y;} i= 1, 2,. . . , n, are independent random variables taking 
three values, 0, 1, and 2, according to the probability distribution defined 
by (9.3.51) and (9.3.52), where we assume 4, = 0, u, = x;f,, and ui; = 
x; f, . Indicate how we can consistently estimate f, , f; , and p using only a 
binary logit program. 


. (Section 9.3.5) 


Write down (9.3.59) and (9.3.60) in the special case where S = 2, B, = 
(1, 2), and B, = (3, 4) and show for which values of the parameters the 
model is reduced to a four-response independent logit model. 


. (Section 9.3.5) 


You are to analyze the decision of high school graduates as to whether or 
not they go to college and, if they go to college, which college they go to. 
For simplicity assume that each student considers only two possible col- 
leges to go to. Suppose that foreach person i, = 1, 2,. . . , n, we observe 
Z; (family income and levels of parents’ education) and x; (the quality 
index and the cost of the jth school), j — 1 and 2. Also suppose that we 
Observe for every person in the sample whether or not he or she went to 
college but observe a college choice only for those who went to college. 
Under these assumptions, define your own model and show how to esti- 
mate the parameters of the model (cf. Radner and Miller, 1970). 


(Section 9.3.6) 

Write down (9.3.64), (9.3.65), and (9.3.66) in the special case of Figure 9.1 
(three-level), that is, C, = (1, 2), C, = (3, 4), B,—(1,2), B5—(3,4), 
B, = (5, 6), and B, = (7, 8). 


(Section 9.3.10) 

Suppose that y, takes values 0, 1, and 2 with the probability distribution 
P(y, = 0) = A(xjBy) and P(y, = 1|y, # 0) = A(x;fl,). Assuming that we 
have n, independent observations on y, with the same value x, of the 
independent variables, t= 1,2, . . . , T, indicate how to calculate the 
MIN 7? estimates of f, and fj, . 


(Section 9.4.3) 

Consider two jointly distributed discrete random variables y and x such 
that y takes two values, 0 and 1, and x takes three values, 0, 1, and 2. The 
most general model (called the saturated model) is the model in which 
there is no constraint among the five probabilities that characterize the 


14. 


15. 


16. 


17. 


18. 


joint distribution. Consider a specific model (called the null hypothesis 
model) in which 


P(y= 11x) 7 [1 + exp (~a — £l! 


and the marginal distribution of xis unconstrained. Given n independent 
Observations on (y, x), show how to test the null hypothesis against the 
saturated model. Write down explicitly the test statistic and the critical 
region you propose. 


(Section 9.4.3) 
Suppose the joint distribution of y, and y, t = 1, 2,. . . , T, is given by 
the following table: 
Y» 
Yu 1 0 
1 dy exp (a’x,+ B’z,) d7! exp (a’x,) 
0 d;! dj! 


where d, = exp (ax, + f'z,) + exp (a'x,) + 2. Given n, independent ob- 
servations on (Yir, Yx), define the minimum chi-square estimates of the 
vectors « and fl. Assume that x, and z, are vectors of known constants. 


(Section 9.4.3) 
Let yj, j= 1,2, . . . , J, be discrete random variables taking N; values. 
Suppose the conditional probability P(ylyi,. . . s Yj-1s Vyas © © Va) 


is given for every j and is positive. Then prove that there is at most one set 
of joint probabilities consistent with the given conditional probabilities 
(cf. Amemiya, 1975). 


(Section 9.4.3) 

Let y, and y, be binary variables taking the value 1 or 0. Show that if 
P(y = 1|y;) = A(X' f, + Bi; y;)and P(y, = 1|y,) = A(x’, + £21 yi), then 
Bi; By. 


(Section 9.5.2) 

In the simple logit model defined in the paragraph after (9.5.22), the 
efficiency ofthe WMLE of fi, using the optimal /* depends only on p and 
Qo. Prove Eff[h*(p, Qo), p, Qo] = Eff[h*(p, 1— Qo), P, 1— Qol. 
(Section 9.5.2) 


In the same model described in Exercise 17, show that if Q = 0.5, 
h* = 0.5. 
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19. 


20. 


21. 


22. 


(Section 9.5.3) 
In the same model described in Exercise 17, show that the asymptotic 
variance of the MME of fi, is the same as that of the WMLE. 


(Section 9.5.3) 
Show that (9.5.36) follows from (9.5.19). 


(Section 9.6.3) 
We are given the following data: 


i 1 2 34 $ 
yi 1 0 01 1 
Xy —1 —1 0 0 1 
x; 0 1-1 10 


a. Obtain the set of f values that maximize 
5 
S(B) = > [yix( Bx; + x3, 2 0) + (1 — yX Xu + Xz <0), 
i-1 


where y(E) = 1 if event E occurs, or is zero otherwise. 
b. Obtain the set of £ values that maximize 


v(B, F) = 2 {y; log F(Bx1; + Xz) 


+ (1 — y,) log [1 — F( Bx, + x2,)), 


where F is also chosen to maximize y among all the possible distribution 
functions. (Note that I have adopted my own normalization, which may 
differ from Manski’s or Cosslett's, so that the parameter space of f is the 
whole real line.) 


(Section 9.6.3) 
Cosslett (1983, p. 780) considered two sequences of y; ordered according 
to the magnitude of x/f: 


SequenceA: 1 O 10 1 1 1 1 
SequenceB: 0 1 1 1 1 1 1 O 


He showed that Sequence A yields a higher value of log L whereas Se- 
quence B yields a higher score. Construct two sequences with nine obser- 
vations each and an equal number of ones such that one sequence yields a 
higher value of log L and the other sequence yields a higher score. Your 
sequence should not contain any of Cosslett's sequences as a subset. 
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23. (Section 9.7.2) 
Assume that f = 0 and {v,,} are i.i.d. across i in model (9.7.11) and derive 
the stationary probability distribution of ( y;) for a particular i. 


24. (Section 9.7.3) 
Show that if T > 4, model (9.7.13) can be stationary if and only if a, is 
constant for all t. Show that if T = 4 and a, > 0, then model (9.7.13) can 
be stationary if and only if œ, is constant for all t. 


10 Tobit Models 


10.1 Introduction 


Tobit models refer to censored or truncated regression models in which the 
range of the dependent variable is constrained in some way. In economics, 
such a model was first suggested in a pioneering work by Tobin (1958). He 
analyzed household expenditure on durable goods using a regression model 
that specifically took account of the fact that the expediture (the dependent 
variable of his regression model) cannot be negative. Tobin called his model 
the model of limited dependent variables. It and its various generalizations are 
known popularly among economists as Tobit models, a phrase coined by 
Goldberger (1964), because of similarities to probit models. These models are 
also known as censored or truncated regression models. The model is called 
truncated if the observations outside a specified range are totally lost and 
censored if we can at least observe the exogenous variables. A more precise 
definition will be given later. 

Censored and truncated regression models have been developed in other 
disciplines (notably, biometrics and engineering) more or less independently 
of their development in econometrics. Biometricians use the model to analyze 
the survival time of a patient. Censoring or truncation occurs when either a 
patient is still alive at the last observation date or he or she cannot be located. 
Similarly, engineers use the model to analyze the time to failure of material or 
of a machine or of a system. These models are called survival or duration 
models. Sociologists and economists have also used survival models to ana- 
lyze the duration of such phenomena as unemployment, welfare receipt, 
employment in a particular job, residence in a particular region, marriage, and 
the period of time between births. Mathematically, survival models belong to 
the same general class of models as Tobit models; survival models and Tobit 
models share certain characteristics. However, because survival models pos- 
sess special features, they will be discussed separately in Chapter 11. 

Between 1958— when Tobin's article appeared —and 1970, the Tobit 
model was used infrequently in econometric applications, but since the early 
1970s numerous applications ranging over a wide area of economics have 
appeared and continue to appear. This phenomenon is due to a recent in- 
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crease in the availability of micro sample survey data, which the Tobit model 
analyzes well, and to a recent advance in computer technology that has made 
estimation of large-scale Tobit models feasible. At the same time, many gener- 
alizations of the Tobit model and various estimation methods for these 
models have been proposed. In fact, models and estimation methods are now 
so numerous and diverse that it is difficult for econometricians to keep track of 
all the existing models and estimation methods and maintain a clear notion of 
their relative merits. Thus it is now particularly useful to examine the current 
situation and prepare a unified summary and critical assessment of existing 
results. 

We shall try to accomplish this objective by means of classifying the diverse 
Tobit models into five basic types. (Our review of the empirical literature 
suggests that roughly 95% of the econometric applications of Tobit models fall 
into one of these five types.) Although there are many ways to classify Tobit 
models, we have chosen to classify them according to the form of the likeli- 
hood function. This way seems to be the statistically most useful classification 
because a similarity in the likelihood function implies a similarity in the 
appropriate estimation and computation methods. It is interesting to note 
that two models that superficially seem to be very different from each other 
can be shown to belong to the same type when they are classified according to 
this scheme. 

Sections 10.2 through 10.5 will deal with the standard Tobit model (or Type 
1 Tobit), and Sections 10.6 through 10.10 will deal with the remaining four 
types of models. Basic estimation methods, which with a slight modification 
can be applied to any of the five types, will be discussed at great length in 
Section 10.4. More specialized estimation methods will be discussed in rele- 
vant passages throughout the chapter. Each model is illustrated with a few 
empirical examples. 

We shail not discuss disequilibrium models except for a few basic models, 
which will be examined in Section 10.10.4. Some general references on dis- 
equilibrium models will be cited there. Nor shall we discuss the related topic of 
switching regression models. For a discussion of these topics, the reader 
should consult articles by Maddala (1980, 1983). We shall not discuss Tobit 
models for panel data (individuals observed through time), except to mention 
a few articles in relevant passages. 


10.2 Standard Tobit Model (Type 1 Tobit Model) 


Tobin (1958) noted that the observed relationship between household ex- 
penditures on a durable good and household incomes looks like Figure 10.1, 
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Figure 10.1 An example of censored data 


where each dot represents an observation for a particular household. An 
important characteristic of the data is that there are several observations 
where the expenditure is 0, This feature destroys the linearity assumption so 
that the least squares method is clearly inappropriate. Should we fit a nonlin- 
ear relationship? First, we must determine a statistical model that can generate 
the kind of data depicted in Figure 10.1. In doing so the first fact we should 
recognize is that we cannot use any continuous density to explain the condi- 
tional distribution of expenditure given income because a continuous density 
is inconsistent with the fact that there are several observations at 0. We shall 
develop an elementary utility maximization model to explain the phenome- 
non in question. 

Define the symbols needed for the utility maximization model as follows: 


y ahousehold's expenditure on a durable good 
yo the price of the cheapest available durable good 
z allthe other expenditures 

X income 


A household is assumed to maximize utility U(y, z) subject to the budget 
constraint y + z S x and the boundary constraint y = yo or y = 0. Suppose y* 
is the solution of the maximization subject to y + z S x but not the other 
constraint, and assume y* = fj, + f; x + u, where u may be interpreted as the 
collection of all the unobservable variables that affect the utility function. 
Then the solution to the original problem, denoted by y, can be defined by 


y=y* if y*> y (10.2.1) 
=0 or y, if y* Sy. 
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If we assume that uisa random variable and that y, varies with households but 
is assumed known, this model will generate data like Figure 10.1. We can write 
the likelihood function for n independent observations from the model 
(10.2.1) as 


L= i F{yoi) II/o». (10.2.2) 


where F, and f, are the distribution and density function, respectively, of yf, 
Il; means the product over those i for which y? S yy, and II, means the 
product over those i for which yf > Ya. Note that the actual value of y when 
y*.S yy has no effect on the likelihood function. Therefore the second line of 
Eq. (10.2.1) may be changed to the statement “if y* S yy, one merely observes 
that fact." 

The model originally proposed by Tobin (1958) is essentially the same as 
the model given in the preceding paragraph except that he specifically as- 
sumed y* to be normally distributed and assumed y, to be the same for all 
households. We define the standard Tobit model (or Type 1 Tobit) as follows: 


ye=x Btu, i-1,2,...,n, (10.2.3) 
yxy7yt if yr>0 (10.2.4) 
=0 if yf so, 
where (u;) are assumed to be i.i.d. drawings from M(0, o?). It is assumed that 
(yj) and {x,} are observed for i= 1,2,... , but (yf) are unobserved if 


y? = 0. Defining X to be the n X K matrix the ith row of which is xj, we assume 
that (x,) are uniformly bounded and lim,._... n^! X'X is positive definite. We 
also assume that the parameter space of f and c? is compact. In the Tobit 
model we need to distinguish the vectors and matrices of positive observations 
from the vectors and matrices of all the observations; the latter appear with an 
underbar. 

Note that y* > 0 and yf £ 0 in (10.2.4) may be changed to y? > y, and 
y* S yo without essentially changing the model, whether y, is known or un- 
known, because y, can be absorbed into the constant term of the regression. If, 
however, yo; changes with i and is known for every i, the model is slightly 
changed because the resulting model would be essentially equivalent to the 
model defined by (10.2.3) and (10.2.4), where one of the elements of f other 
than the constant term is known. The model where yp; changes with i and is 
unknown is not generally estimable. 

The likelihood function of the standard Tobit model is given by 


L= II [! — B(x; /o)] II o^ !d[Cy — x 8)/o], (10.2.5) 


364 Advanced Econometrics 


where ® and ¢ are the distribution and density function, respectively, of the 
standard normal variable. 

The Tobit model belongs to what is sometimes known as the censored 
regression model. In contrast, when we observe neither y, nor x; when yf £ 0, 
the model is known as a truncated regression model. The likelihood function 
of the truncated version of the Tobit model can be written as 


L= II (x; /o) o y — xi By]. (10.2.6) 


Henceforth, the standard Tobit model refers to the model defined by (10.2.3) 
and (10.2.4), namely, a censored regression model, and the model the likeli- 
hood function of which is given by (10.2.6) will be called the truncated stan- 
dard Tobit model. 


10.3 Empirical Examples 


Tobin (1958) obtained the maximum likelihood estimates of his model ap- 
plied to data on 735 nonfarm households obtained from Surveys of Consumer 
Finances. The dependent variable of his estimated model was actually the 
ratio of total durable goods expenditure to disposable income and the inde- 
pendent variables were the age of the head of the household and the ratio of 
liquid assets to disposable income. 

Since then, and especially since the early 1970s, numerous applications of 
the standard Tobit model have appeared in economic journals, encompassing 
a wide range of fields in economics. A brief list of recent representative refer- 
ences, with a description of the dependent variable ( y) and the main indepen- 
dent variables (x), is presented in Table 10.1. In all the references, except that 
by Kotlikoff, who uses a two-step estimation method to be discussed later, the 
method of estimation is maximum likelihood. 


10.4 Properties of Estimators under Standard Assumptions 


In this section we shall discuss the properties of various estimators of the Tobit 
model under the assumptions of the model. The estimators we shall consider 
are the probit maximum likelihood (ML), least squares (LS), Heckman's 
two-step least squares, nonlinear least squares (NLLS), nonlinear weighted 
least squares (NLWLS), and Tobit ML estimators. 


Table 10.1 Applications of the standard Tobit model 


Reference y x 
Adams (1980) Inheritance Income, marital status, 
number of children 
Ashenfelter and Ham Ratio of unemployed Years of schooling, 
(1979) hours to employed working experience 
hours 


Fair (1978) 


Keeley et al. (1978) 


Kotlikoff (1979) 


Reece (1979) 


Rosenzweig (1980) 


Stephenson and 
McDonald (1979) 


Wiggins (1981) 


Witte (1980) 


Number of extramarital 
affairs 


Hours worked after a 
negative income tax 
program 


Expected age of retirement 


Charitable contributions 


Annual days worked 


Family earnings after a 
negative income tax 
program 


Annual marketing of new 
chemical entities 


Number of arrests (or 
convictions) per month 
after release from prison 


Sex, age, number of years 
married, number of 
children, education, 
occupation, degree of 
religiousness 

Preprogram hours 
worked, change in the 
wage rate, family 
characteristics 

Ratio of social security 
benefits lost at time of 
full-time employment 
to full-time earnings 

Price of contributions, 
income 

Wages of husbands and 
wives, education of 
husbands and wives, 
income 

Earnings before the 
program, husband's 
and wife’s education, 
other family 
characteristics, 
unemployment rate, 
seasonal dummies 

Research expenditure of 
the pharmaceutical 
industry, stringency of 
government regulatory 
standards 

Accumulated work release 
funds, number of 
months after release 
until first job, wage rate 
after release, age, race, 
drug use 
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10.4.1 Probit Maximum Likelihood Estimator 
The Tobit likelihood function (10.2.5) can be trivially rewritten as 
L=[[ — 9x/8/0) [ (xi 8/o) (10.4.1) 
0 1 


TI eise) c't XBY 


Then the first two products of the right-hand side of (10.4.1) constitute the 
likelihood function of a probit model, and the last product is the likelihood 
function of the truncated Tobit model as given in (10.2.6). The probit ML 
estimator of a = B/a, denoted &, is obtained by maximizing the logarithm of 
the first two products. 

Note that we can only estimate the ratio 8/a by this method and not f or o 
separately. Because the estimator ignores a part of the likelihood function that 
involves fi and c, it is not fully efficient. This loss of efficiency is not surprising 
when we realize that the estimator uses only the sign of y?¥, ignoring its 
numerical value even when it is observed. 

From the results of Section 9.2 we see that the probit MLE is consistent and 
follows 


â — a Ê (X'D,X)!X'D,D;'(w — Ew), (10.4.2) 


where Dy is the n X n diagonal matrix the ith element of which is $(x/o), D, is 
the n Xn diagonal matrix the ith element of which is d(xía) [1 — 
Q(x/a)] !ó(xjaY, and w is the n-vector the ith element w; of which is defined 
by 


w,-1 if y*»0 (10.4.3) 
=0 if y*s0. 


Note that the ith element of Ew is equal to D(x/a). The symbol £ means that 
both sides have the same asymptotic distribution.! As shown in Section 9.2.2, 
Gis asymptotically normal with mean a and asymptotic variance-covariance 
matrix given by 


Va = (X’D,X)"'. (10.4.4) 


10.4.2 Least Squares Estimator 


From Figure 10.1 it is clear that the least squares regression of expenditure on 
income using all the observations including zero expenditures yields biased 
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estimates. Although it is not so clear from the figure, the least squares regres- 
sion using only the positive expenditures also yields biased estimates. These 
facts can be mathematically demonstrated as follows. 

First, consider the regression using only positive observations of y;. We 
obtain from (10.2.3) and (10.2.4) 


E(yiy; > 0) = xiB + Elulu; > —xip). (10.4.5) 


The last term of the right-hand side of (10.4.5) is generally nonzero (even 
without assuming u, is normal). This implies the biasedness of the LS estima- 
tor using positive observation on y; under more general models than the 
standard Tobit model. When we assume normality of u; as in the Tobit model, 
(10.4.5) can be shown by straightforward integration to be 


E(yily, > 0) = xi + oA(x; jo), (10.4.6) 


where A(z) = @(z)/(z).? As shown later, this equation plays a key role in the 
derivation of Heckman’s two-step, NLLS, and NLWLS estimators. 
Equation (10.4.6) clearly indicates that the LS estimator of f is biased and 
inconsistent, but the direction and magnitude of the bias or inconsistency 
cannot be shown without further assumptions. Goldberger (1981) evaluated 
the asymptotic bias (the probability limit minus the true value) assuming that 
the elements of x, (except the first element, which is assumed to be a constant) 
are normally distributed. More specifically, Goldberger rewrote (10.2.3) as 


yt =Bot xB, + u; (10.4.7) 


and assumed x; ~ N(0, X), distributed independently of u,;. (Here, the as- 
sumption of zero mean involves no loss of generality because a nonzero mean 
can be absorbed into ^ Under this assumption he obtained 


plim f, = = — B, (10.4.8) 


where y= sia Bola Bo + FyA(Bo/a,)] and ?—5,?]iEf,, where 
o? = o? + PÈR. It can be shown that 0 < y < 1 and 0 <p? < 1; therefore 
(10.4.8) shows that fl, shrinks f, toward 0. It is remarkable that the degree of 
shrinkage is uniform in all the elements of £,. However, the result may not 
hold if x, is not normal; Goldberger gave a nonnormal example where fj, = 
(1, 1) and plim fl, = (1.111, 0.887y. 

Consider the regression using all the observations of y,, both positive and 0. 
To see that the least squares estimator is also biased in this case, we should look 
at the unconditional mean of y,, 
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Ey; = O(x; B/oYx; B + o(x; p/o). (10.4.9) 


Writing (10.2.3) again as (10.4.7) and using the same assumptions as Gold- 
berger, Greene (1981) showed 


plim B, = $(/,/o, f, (10.4.10) 


where BÉ is the LS estimator of ff, in the regression of y; on x; using all the 
observations. This result is more useful than (10.4.8) because it implies that 
(n/n,)B, is a consistent estimator of f,, where n, is the number of positive 
observations of y;. A simple consistent estimator of f can be similarly ob- 
tained. Greene (1983) gave the asymptotic variances of these estimators. 
Unfortunately, however, we cannot confidently use this estimator without 
knowing its properties when the true distribution of x, is not normal.? 


10.4.3 Heckman's Two-Step Estimator 


Heckman (19762) proposed a two-step estimator in a two-equation general- 
ization ofthe Tobit model, which we shall call the Type 3 Tobit model. But his 
estimator can also be used in the standard Tobit model, as well as in more 
complex Tobit models, with only a minor adjustment. We shall discuss the 
estimator in the context of the standard Tobit model because all the basic 
features of the method can be revealed in this model. However, we should 
keep in mind that since the method requires the computation of the probit 
MLE, which itself requires an iterative method, the computational advantage 
of the method over the Tobit MLE (which is more efficient) is not as great in 
the standard Tobit model as it is in more complex Tobit models. 
To explain this estimator, it is useful to rewrite (10.4.6) as 


y, Xib + aA(xia) + €, for i suchthat y,>0, (10.411) 


where we have written a = B/a as before and €; = y, — E(yj|y; > 0) so that 
Ee, — 0. The variance of e, is given by 


Ve, = o? — Px aila) — o?i(xja. (10.4.12) 


Thus (10.4.11) is a heteroscedastic nonlinear regression model with n, obser- 
vations. The estimation method Heckman proposed consists of two steps: 

Step 1. Estimate a by the probit MLE (denoted &) defined earlier. 

Step 2. Regress y; on x, and A(x/à) by least squares, using only the positive 
observations on y;. 

To facilitate further the discussion of Heckman's estimator, we can rewrite 
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(10.4.11) again as 
y, x p + ad(xia) + e; + n, (10.4.13) 
for i suchthat yj» 0, 
where n; = a[A(x/a) — A(x/G)]. We can write (10.4.13) in vector notation as 
y-Xf-*cá- e^, (10.4.14) 


where the vectors y, Á, €, and y have n, elements and matrix X has n, rows, 
corresponding to the positive observations of y,. We can further rewrite 
(10.4.14) as 


y-2Zytecnm (10.4.15) 


where we have defined Z = (X, Â and y = (fl^, oy. Then Heckman’s two-step 
estimator of y is defined as 


?-(2)2Zy. (10.4.16) 


The consistency of follows easily from (10.4.15) and (10.4.16). We shall 
derive its asymptotic distribution for the sake of completeness, although the 
result is a special case of Heckman's result (Heckman, 1979). From (10.4.15) 
and (10.4.16) we have 


Va — ») = (np ZZ) (nj Ee + ni 22/9). (10.4.17) 
Because the probit MLE à is consistent, we have 


plim nī 12/2 = lim ni'ZZ, (10.4.18) 


ayo no 


where Z = (X, A). Under the assumptions stated after (10.2.4), it can be shown 
that 


ny '?2/e — NO, o? lim ny 'Z'EZ), (10.4.19) 
where c?X, = Eee’ is the n, X n, diagonal matrix the diagonal elements of 
which are Ve, given in (10.4.12). We have by Taylor expansion of A(x(à) 
around A(x ;a) 

OA ^ — 

n= -0 (& — a) + O(n"). (10.4.20) 

Using (10.4.20) and (10.4.2), we can prove 


n;'?2/2 N[0, ?Z/1— )KXD,X)"X'0—X)Z) — (0421) 
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where D, was defined after (10.4.2). Next, note that € and 7 are uncorrelated 
because y is asymptotically a linear function of w on account of (10.4.2) and 
(10.4.20) and e and w are uncorrelated. Therefore, from (10.4.17), (10.4.18), 
(10.4.19), and (10.4.21), we finally conclude that 7 is asymptotically normal 
with mean y and asymptotic variance-covariance matrix given by 


Vj = o«Z'Zy ZZ + (1 — Z)X(X'D, Xy 'X'(I — Z)Z(Z/Zy !. 
(10.4.22) 


Expression (10.4.22) may be consistently estimated either by replac- 
ing the unknown parameters by their consistent estimates or by 
(Z/Zy 'Z'AZ(Z/Zy !, where A is the diagonal matrix the ith diagonal element 
of which is [y; — x; f — GA(x/&)]’, following the idea of White (1980). 

Note that the second matrix within the square bracket in (10.4.22) arises 
because 4 had to be estimated. If A were known, we could apply least squares 
directly to (10.4.11) and the exact variance-covariance matrix would be 
L'I 'ZxZ2Z.-. 

Heckman's two-step estimator uses the conditional mean of y, given in 
(10.4.6). A similar procedure can also be applied to the unconditional mean of 
y, given by (10.4.9). That is to say, we can regress all the observations of y;, 
including zeros, on (bx, and ¢ after replacing the a that appears in the argu- 
ment of and ¢ by the probit MLE à. In the same way as we derived (10.4.11) 
and (10.4.13) from (10.4.6), we can derive the following two equations from 
(10.4.9): 


y, = P(xja)[x; B + oA(xja)] + 6; (10.4.23) 
and 

Vi = Olx B + oA(x;à)] + 6, + &;, (10.4.24) 
where ô; = y, — Ey, and & = [B(xja) — (xix; f + olxa) — ó(xjà)]. A 
vector equation comparable to (10.4.15) is 

y=DZy+d+s (10.4.25) 


where D is the n X n diagonal matrix the ith element of which is P(x/d). Note 
that the vectors and matrices appear with underbars because they consist of n 
elements or rows. The two-step estimator of y based on all the observations, 
denoted ¥, is defined as 


J= (2/1»2) 2 Dy. (10.4.26) 
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The estimator can easily be shown to be consistent. To derive its asymptotic 
distribution. we obtain from (10.4.25) and (10.4.26) 


Vai — ») = (n275?2) (n EDS - n^^2/b8. — (10427) 


Here, unlike the previous case, an interesting fact emerges: By expanding 
C (x;&à) and $(x/&) in Taylor series around x/a we can show &— O(n7'). 
Therefore 


plim nÊ Ôg = 0. (10.4.28) 
Corresponding to (10.4.18), we have 
plim n~'Ê Â = lim n—Z’'D°Z, (10.4.29) 


where D is obtained from D by replacing & with a. Corresponding to (10.4.19), 
we have 


n-?2/08 — NO, c? lim n-'Z’D?QZ), (10.4.30) 


where o?Q. = Edd’ is the n X n diagonal matrix the ith element of which is 
o^b(x'ay(xjay + xaxa) + 1 — o(xja)[xja + A(x;a)p^). Therefore, from 
(10.4.27) through (10.4.30), we conclude that F is asymptotically normal with 
mean y and asymptotic variance-covariance matrix given by? 


Vj-o(ZWzZyzbozzbszy:. (10.4.31) 


Which of the two estimators j and 7 is preferred? Unfortunately, the differ- 
ence of the two matrices given by (10.4.22) and (10.4.31) is generally neither 
positive definite nor negative definite. Thus an answer to the preceding ques- 
tion depends on parameter values. 

Both (10.4.15) and (10.4.25) represent heteroscedastic regression models. 
Therefore we can obtain asymptotically more efficient estimators by using 
weighted least squares (WLS) in the second step of the procedure for obtaining 
j and 7. In doing so, we must use a consistent estimate of the asymptotic 
variance-covariance matrix of € + 7 for the case of (10.4.15) and of ô + ¢ for 
the case of (10.4.25). Because these matrices depend on y, an initial consistent 
estimate of y (say, for 7) is needed to obtain the WLS estimators. We call these 
WLS estimators fw and Fw, respectively. It can be shown that they are consist- 
entand asymptotically normal with asymptotic variance-covariance matrices 
given by 


Vj, —oXZ[E-T-(1—ZX)XXD,X)'X(I—2)^Zk*! (10432) 
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and 
Vow = o (Z'D'Q^!Zy !. (10.4.33) 


Again, we cannot make a definite comparison between the two matrices. 


10.4.4 Nonlinear Least Squares and Nonlinear Weighted Least 
Squares Estimators 


In this subsection we shall consider four estimators: the NLLS and NLWLS 
estimators applied to (10.4.11), denoted jy and jw, respectively, and the 
NLLS and NLWLS estimators applied to (10.4.23), denoted jy and Jaw, 
respectively. 

All these estimators are consistent and their asymptotic distributions can be 
obtained straightforwardly by noting that all the results of a linear regression 
model hold asymptotically for a nonlinear regression model if we treat the 
derivative of the nonlinear regression function with respect to the parameter 
vector as the regression matrix.® In this way we can verify the interesting fact 
that ŷn and yw have the same asymptotic distributions as 7 and jy, respec- 
tively." We can also show that 7, and jw are asymptotically normal with 
mean y and with their respective asymptotic variance-covariance matrices 
given by 


Vj = o*S'S)-'S'ZS(S'S)! (10.4.34) 
and 
Vinw = oXS'X-'S)'!, (10.4.35) 


where S = (XX, DA), where D, is the n, X n, diagonal matrix the ith element 
of which is 1 + (xj@)? + xia4(x/a). We cannot make a definite comparison 
either between (10.4.22) and (10.4.34) or between (10.4.32) and (10.4.35). 
In the two-step methods defining 7 and j and their generalizations fw and 
Jw, we can naturally define an iteration procedure by repeating the two steps. 
For example, having obtained f, we can obtain a new estimate of a, insert it 
into the argument of A, and apply least squares again to Eq. (10.4.11). The 
procedure is to be repeated until a sequence of estimates of a thus obtained 
converges. In the iteration starting from jy, we use the mth-round estimate of 
y not only to evaluate A but also to estimate the variance-covariance matrix of 
the error term for the purpose of obtaining the (m + 1)st-round estimate. 
Iterations starting from j and jy can be similarly defined but are probably not 
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worthwhile because ĵ and jy are asymptotically equivalent to jy and Jyw, as 
we indicated earlier. The estimators (Jy, uw, Yn, Pw) are clearly stationary 
values of the iterations starting from (f, fy, F, Jw). However, they may not 
necessarily be the converging values. 

A simulation study by Wales and Woodland (1980) based on only one 
replication with sample sizes of 1000 and 5000 showed that jy is distinctly 
inferior to the MLE and is rather unsatisfactory. 


10.4.5 Tobit Maximum Likelihood Estimator 


The Tobit MLE maximizes the likelihood function (10.2.5). Under the as- 
sumptions given after (10.2.4), Amemiya (1973c) proved its consistency and 
asymptotic normality. If we define 0 = ( B', a?)', the asymptotic variance-co- 


variance matrix of the Tobit MLE Ó is given by 
-1 


V8 — | '7 m ; (10.4.36) 


where 
a, — —o ?(xjab, — [P7/(1 — )] — D), 
b, = (1/2)? (xia^d + 6; — agd — ®,)]), and 
c; 7 — (1/4)0 *((xja o  (xjo)o, — [(xja)h7/(1 — $)] — 29); 


and d; and qb, stand for $(x;a) and d (x/a), respectively. 

The Tobit MLE must be computed iteratively. Olsen (1978) proved the 
global concavity of log L in the Tobit model in terms of the transformed 
parameters œ = fl/c and h = o^ !, a result that implies that a standard iterative 
method such as Newton-Raphson or the method of scoring always converges 
to the global maximum of log L.! The log L in terms of the new parameters 
can be written as 


log L — p> log [1 — ®(x/a)] + n, log h (10.4.37) 


1 
-3 Xn xay. 
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from which Olsen obtained 


?logL ðlogL 


Sada” doh (10.4.38) 
logL #logL 
ohda’ oh? 
i ly — Qi ) / 
p> 1-6, (xo 1-6, xx; 0 
n 
0 -7 


xx EX. 
~ Bm D 


Because xja — [1 — (xjo)] !$(x/a) < 0, the right-hand side of (10.4.38) is 
the sum of two negative-definite matrices and hence is negative definite. 

Even though convergence is assured by global concavity, it is a good idea to 
start an iteration with a good estimator because it will improve the speed of 
convergence. Tobin (1958) used a simple estimator based on a linear approxi- 
mation of the reciprocal of Mills’ ratio to start his iteration for obtaining the 
MLE. Although Amemiya (1973c) showed that Tobin’s initial estimator is 
inconsistent, empirical researchers have found it to be a good starting value for 
iteration. 

Amemiya (1973) proposed the following simple consistent estimator. We 
have 


E( y2|y; > 0) = (xi BY + ox; BA(xja) + o?. (10.4.39) 
Combining (10.4.6) and (10.4.39) yields 
E(yily, > 0) = x; BEAY: > 0) + 0, (10.4.40) 


which can be alternatively written as 
ye=yxiBP+or+¢,, for i suchthat y,>0, (10.4.41) 


where E(¢,|y, > 0) = 0. Then consistent estimates of f and c? are obtained by 
applying an instrumental variables method to (10.4.41) using (fjx;, 1) as the 
instrumental variables, where f, is the predictor of y; obtained by regressing 
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positive y; on x, and, perhaps, powers of x;. The asymptotic distribution of the 
estimator has been given by Amemiya (1973c). A simulation study by Wales 
and Woodland (1980) indicated that this estimator is rather inefficient. 


10.4.6 The EM Algorithm 


The EM algorithm is a general iterative method for obtaining the MLE; it was 
first proposed by Hartley (1958) and was generalized by Dempster, Laird, and 
Rubin (1977) to a form that is especially suited for censored regression models 
such as Tobit models. We shall first present the definition and the properties of 
the EM algorithm under a general setting and then apply it to the standard 
Tobit model. It can also be applied to general Tobit models. 

We shall explain the EM algorithm in a general model where a vector of 
observable variables z is related to a vector of unobservable variables y* in 
such a way that the value of y* uniquely determines the value of z but not vice 
versa. In the Tobit model, ( y*} defined in (10.2.3) constitute the elements of 
y*, and (yj) and (wj) defined in (10.2.4) and (10.4.3), respectively, constitute 
the elements of z. Let the joint density or probability of y* be f(y*) and let the 
joint density or probability of z be g(z). Also, define k(y*|z) — f(y*)/g(z). Note 
that f(y*, z) = f(z|y*)f(y*) —f(y*) because f(z|y*) — 1 inasmuch as y* 
uniquely determines z. We implicitly assume that f, g, and k depend on a 
vector of parameters 0. The purpose is to maximize 


L(0) =n“ log g(z) = n^! log f(y*) — n^! log k(y*|z) (10.4.42) 
with respect to 0. Define 
Q(8/0,) = E[n"' log f(y*|0)|z, 0,], (10.4.43) 


where we are taking the expectation assuming @, is the true parameter value 
and doing this conditional on z. Then the EM algorithm purports to maximize 
L(0) by maximizing Q(0|0,) with respect to @ where 0, is given at each step of 
the iteration. The “E” of the name “EM” refers to the expectation taken in 
(10.4.43) and the “M” refers to the maximization of (10.4.43). 

Consider the convergence properties of the EM algorithm. Define 


H(0|0,) = E[n! log k(y*|z, 0)|z, 0,]. (10.4.44) 


Then we have from (10.4.42), (10.4.43), and (10.4.44) and the fact that 
L(6|6,) = E[n™ log g(z)|z, 6,] = L(A) 


L(0) = Q(010,) — H(0/0,). (10.4.45) 
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But we have by Jensen’s inequality (4.2.6) 
H(08) < H(008) for 0*0, (10.4.46) 
Now, given @,, let M(0,) maximize Q(0|0,) with respect to 0. Then we have 
L(M) = Q(MIJ0,) — H(My0,). (10.4.47) 


But, because Q(M[@,) = Q(8|0,) by definition and H(MI0,) s H(0,10,) by 
(10.4.46), we have from (10.4.45) and (10.4.47) 


L(M) 2 L(0,). (10.4.48) 


Thus we have proved the desirable property that L always increases or stays 
constant at each step of the EM algorithm. 

The preceding result implies that if L is bounded, then lim,_... L(0,) exists. 
Let 0* satisfy the equality lim, ,4 L(@,) = L(0*). (0* exists if the range of L (0) 
is closed.) We shall show that if 0* is a stationary point of the EM algorithm, 
then 0* is a stationary point of L. For this we assume L is twice differentiable. 
Differentiating (10.4.45) with respect to 0 and evaluating the derivative at 
0 — 0,, we obtain 


_ 8Q(810.) 
e, — 90 


aL 


00 


aH(0|0,) 


o — 80 


(10.4.49) 


But the last term of the right-hand side of (10.4.49) is 0 because of (10.4.46). 
Therefore, if 8, is a stationary point of Q(0|0,), it is a stationary point of L. 

Unfortunately, a local maximum of Q(0|0,) with respect to 0 may not bea 
local maximum (let alone the global maximum) of L(@) because the negative 
definiteness of [92Q(0/0,)/0000"],, does not imply the negative definiteness of 
[92L/0000"],, . However, this is true of any iterative method commonly used. 
See Wu (1983) for further discussion of the convergence properties of the EM 
algorithm. 

Now consider an application of the algorithm to the Tobit model.? Define 
6 = (f, 0?) as before. Then in the Tobit model we have 


1 n 
log f(y*|8) = -3 log c? — p 2 (yt — xi By, (10.4.50) 


and, for a given estimate 0, = (f, 01), the EM algorithm maximizes with 
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respect to f and o? 


Eflog f(y*|0)|y, w, 0,] (10.4.51) 


_ on _ 1 uw ] , 
= 5 log o° JA & Oi x BY — 35 25 FO? — xi BY lm. = 0, 0] 


n 1 , 1 , 
773 log dnb? Qi — BY 5 » [E(yf|w; = 0, 0,) — xiBP 
1 
EP? Viyt lw, = 0, 0,), 


where 


/ [^] 
ECyf|w; = 0, 6) = xi iata. a ys (10.4.52) 
il 


and 


2 
VOyfIw, = 0, 6,) = a1 + xi; Dn - ass » (10.4.53) 
where Ġan = $(xif, /o,) and ©, = O(x/B, /a;). 

From (10.4.51) it is clear that the second-round estimate of f in the EM 
algorithm, denoted f, , is obtained as follows: Assume without loss of general- 
ity that the first n, observations of y; are positive and call the vector of those 
observations y. Next, define an (n — n, )-vector y? the elements of which are 
the y? defined in (10.4.52). Then we have 


h= awa [t] (10.4.54) 


where X was defined after (10.2.4). In other words, the EM algorithm amounts 
to predicting all the unobservable values of y? by their conditional expecta- 
tions and treating the predicted values as if they were the observed values. The 
second-round estimate of g?, denoted o3, is given by 


i-e [x ox + X ot- xe 104.55 
+ » Vy? |w; = 0, s| 


We can directly show that the MLE Ó is the equilibrium solution of the 
iteration defined by (10.4.54) and (10.4.55). Partition X = (X’, X°’)’ so that 
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X is multiplied by y and X? by y?. Then inserting ĝinto both sides of (10.4.54) 
yields, after collecting terms, 


ELOR, (10.4.56) 
1 — O(x;8/o) 
where the last bracket denotes an (n — n,)-dimensional vector the typical 


element of which is given inside. Now, setting the derivative of log L with 
respect to 8 equal to 0 yields 


X'Xf - X'y - X | 


$, 
—g6 » i= i @, * + X (y; — xfx; = 0. (10.4.57) 
But, clearly, (10.4.56) is equivalent to (10.4.57). Similarly, the normal equa- 
tion for a? can be shown to be equivalent to (10.4.55). 

Schmee and Hahn (1979) performed a simulation study of the EM algo- 
rithm applied to a censored regression model (a survival model) defined by 


y7yt if yf sc 
=c if yf»e 
where yf ~ Ma + fix;, 07). They generally obtained rapid convergence. 


10.5 Properties of the Tobit Maximum Likelihood Estimator under 
Nonstandard Assumptions 


In this section we shall discuss the properties of the Tobit MLE — the estima- 
tor that maximizes (10.2.5) — under various types of nonstandard assump- 
tions: heteroscedasticity, serial correlation, and nonnormality. It will be 
shown that the Tobit MLE remains consistent under serial correlation but not 
under heteroscedasticity or nonnormality. The same is true of the other esti- 
mators cor.sidered earlier. This result contrasts with the classical regression 
model in which the least squares estimator (the MLE under the normality 
assumption) is generally consistent under all ofthe three types of nonstandard 
assumptions mentioned earlier. 

Before proceeding with a rigorous argument, we shall give an intuitive 
explanation of the aforementioned result. By considering (10.4.1 1) we see that 
serial correlation of y; should not affect the consistency ofthe NLLS estimator, 
whereas heteroscedasticity changes ø to g; and hence invalidates the estima- 
tion of the equation by least squares. If y* is not normal, Eq. (10.4.11) itself is 
generally invalid, which leads to the inconsistency of the NLLS estimator. 
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Although the NLLS estimator is different from the ML estimator, we can 
expect a certain correspondence between the consistency properties of the two 
estimators. 


10.5.1 Heteroscedasticity 


Hurd (1979) evaluated the probability limit of the truncated Tobit MLE when 
a certain type of heteroscedasticity is present in two simple truncated Tobit 
models: (1) the i.i.d. case (that is, the case of the regressor consisting only of a 
constant term) and (2) the case of a constant term plus one independent 
variable. Recall that the truncated Tobit model is the one in which no infor- 
mation is available for those observations for which y? < 0 and therefore the 
MLE maximizes (10.2.6) rather than (10.2.5). 

In the i.i.d. case Hurd created heteroscedasticity by generating rn observa- 
tions from Mu, o?) and (1 — r)n observations from N(u, 02). In each case he 
recorded only positive observations. Let y, i = 1, 2, . . . , n), be the recorded 
observations. (Note n, = n). We can show that the truncated Tobit MLE of u 
and c?, denoted ji and ĝ?, are defined by equating the first two population 
moments of y; to their respective sample moments: 


Â + ACAS) = ni! Y y, (10.5.1) 
i=] 
and 
m 
A?  GÁM(Á/0) + ê? — n! p yl. (10.5.2) 
-1 


Taking the probability limit of both sides of (10.5.1) and (10.5.2) and express- 
ing plim n;'Zy; and plim n; !Zy? as certain functions of the parameters 
u, 02, 02, and r, we can define plim Z and plim 6? implicitly as functions of 
these parameters. Hurd evaluated the probability limits for various values of 4 
and c, after having fixed r — 0.5 and c; — 1. Hurd found large asymptotic 
biases in certain cases. 

In the case of one independent variable, Hurd generated observations from 
Na + Bx;, o?) after having generated x, and log|oj from bivariate 
N(0, 0, V2, V3, p). For given values of a, f, Vi, Va, and p, Hurd found the 
values of a, f, and o? that maximize E log L, where L is as given in (10.2.6). 
Those values are the probability limits of the MLE of a, 8, and c? under 
Hurd's model if the expectation oflog L is taken using the same model. Again, 
Hurd found extremely large asymptotic biases in certain cases. 
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Arabmazar and Schmidt (1981) showed that the asymptotic biases of the 
censored Tobit MLE in the i.i.d. case are not as large as those obtained by 
Hurd. 


10.5.2 Serial Correlation 


Robinson (1982a) proved the strong consistency and the asymptotic normal- 
ity of the Tobit MLE under very general assumptions about u; (normality is 
presupposed) and obtained its asymptotic variance-covariance matrix, which 
is complicated and therefore not reproduced here. His assumptions are 
slightly stronger than the stationarity assumption but are weaker than the 
assumption that u; possesses a continuous spectral density (see Section 5.1.3). 
His results are especially useful because the full MLE that takes account of 
even a simple type of serial correlation seems computationally intractable. 
The autocorrelations of u; need not be estimated to compute the Tobit MLE 
but must be estimated to estimate its asymptotic variance-covariance matrix. 
The consistent estimator proposed by Robinson (1982b) may be used for that 


purpose. 


10.5.3 Nonnormality 


Goldberger (1983) considered an i.i.d. truncated sample model in which data 
are generated by a certain nonnormal distribution with mean 4 and variance 1 
and are recorded only when the value is smaller than a constant c. Let y 
represent the recorded random variable and let y be the sample mean. The 
researcher is to estimate u by the MLE, assuming that the data are generated 
by N(4, 1). As in Hurd's i.i.d. model, the MLE 4 is defined by equating the 
population mean of y to its sample mean: 


B —A(c— =y. (10.5.3) 


Taking the probability limit of both sides of (10.5.3) under the true model and 
putting plim Z = u* yield 


u* — (c - u*) ^u — hle — u), (10.5.4) 


where h(c — u) = E(u — yly < c), the expectation being taken using the true 
model. Defining m = u* — u and 0 = c — u, we can rewrite (10.5.4) as 


m= A(0 — m) — h(0). (10.5.5) 
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Goldberger calculated m as a function of 0 when the data are generated by 
Student’s ¢ with various degrees of freedom, Laplace, and logistic distribu- 
tions. The asymptotic bias was found to be especially great when the true 
distribution was Laplace. Goldberger also extended the analysis to the regres- 
sion model with a constant term and one discrete independent variable. 
Arabmazar and Schmidt (1982) extended Goldberger’s analysis to the case of 
an unknown variance and found that the asymptotic bias was further accen- 
tuated. 


10.5.4 Tests for Normality 


The fact that the Tobit MLE is generally inconsistent when the true distribu- 

tion is nonnormal makes it important for a researcher to test whether the data 

are generated by a normal distribution. Nelson (1981) devised tests for nor- 

mality in the i.i.d. censored sample model and the Tobit model. His tests are 

applications of the specification test of Hausman (1978) (see Section 4.5.1). 
Nelson’s i.i.d. censored model is defined by 


y-7yt if y7>0 
-0 if y*s0,  i-212,...,n, 


where y? ~ N( u, a?) under the null hypothesis. Nelson considered the esti- 
mation of P(y* > 0). Its MLE is (///0), where /i and ¢ are the MLE of the 
respective parameters. A consistent estimator is provided by n, /n, where, as 
before, n, is the number of positive observations of y;. Clearly, n, /n is a 
consistent estimator of P(yf > 0) under any distribution, provided that it is 
i.i.d. The difference between the MLE and the consistent estimator is used as a 
test statistic in Hausman's test. Nelson derived the asymptotic variances ofthe 
two estimators under normality. 

If we interpret what is being estimated by the two estimators as 
lim, ,, 7^! Z2 ,P(yf > 0), Nelson's test can be interpreted as a test of the null 
hypothesis against a more general misspecification than just nonnormality. In 
fact, Nelson conducted a simulation study to evaluate the power of the test 
against a heteroscedastic alternative. The performance ofthe test was satisfac- 
tory but not especially encouraging. 

In the Tobit model Nelson considered the estimation of 
n |EX'y —n Zi x,[O(x/a)x/B + o(x;a)]. Its MLE is given by the right- 
hand side of this equation evaluated at the Tobit MLE, and its consistent 
estimator is provided by n^! X’ y. Hausman's test based on these two estima- 
tors will work because this consistent estimator is consistent under general 
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distributional assumptions on y. Nelson derived the asymptotic variance-co- 
variance matrices of the two estimators. 

Nelson was ingenious in that he considered certain functions of the original 
parameters for which estimators that are consistent under very general as- 
sumptions can easily be obtained. However, it would be better if a general 
consistent estimator for the original parameters themselves could be found. 
An example is Powell’s least absolute deviations estimator, to be discussed in 
the next subsection. 

Bera, Jarque, and Lee (1982) proposed using Rao’s score test in testing for 
normality in the standard Tobit model where the error term follows the 
two-parameter Pearson family of distributions, which contains normal as a 
special case. 


10.5.5 Powell's Least Absolute Deviations Estimator 


Powell (1981, 1983) proposed the least absolute deviations (LAD) estimator 
(see Section 4.6) for censored and truncated regression models, proved its 
consistency under general distributions, and derived its asymptotic distribu- 
tion. The intuitive appeal for the LAD estimator in a censored regression 
model arises from the simple fact that in the i.i.d. sample case the median (of 
which the LAD estimator is a generalization) is not affected by censoring 
(more strictly, left censoring below the mean), whereas the mean is. In a 
censored regression model the LAD estimator is defined as that which mini- 
mizes £% |y; — max (0, x}f)|. The motivation for the LAD estimator in a 
truncated regression model is less obvious. Powell defined the LAD estimator 
in the truncated case as that which minimizes 27 ,|y; — max (2^! y;, xjff)|. In 
the censored case the limit distribution of Vn( n(f — A), where Bi is the LAD 
estimator, is normal with zero mean and variance-covariance matrix [4f(0)* 
lim, n! ZL (xj 8 > Oyx;xj] !, where fis the density ofthe error term and y 
is the indicator function taking on unity if x/f > 0 holds and 0 otherwise. In 
the truncated case the limit distribution of Vn (B- f) is normal with zero 
mean and variance-covariance matrix 2^! A^! BA^!, where 


A= lim n7! Y xB > OLO — fox] F GG xxt 
n» iz 
and 
B= lim n^ Y Gif» OLFGGA) — FO]FGGE xix, 
no i=l 


where F is the distribution function of the error term. 
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Powell’s estimator is attractive because it is the only known estimator that is 
consistent under general nonnormal distributions. However, its main draw- 
back is its computational difficulty. Paarsch (1984) conducted a Monte Carlo 
study to compare Powell's estimator, the Tobit MLE, and Heckman's two- 
Step estimator in the standard Tobit model with one exogenous variable under 
situations where the error term is distributed as normal, exponential, and 
Cauchy. Paarsch found that when the sample size is small (50) and there is 
much censoring (5096 of the sample), the minimum frequently occurred at the 
boundary of a wide region over which a grid search was performed. In large 
samples Powell's estimator appears to perform much better than Heckman's 
estimator under any of the three distributional assumptions and much better 
than the Tobit MLE when the errors are Cauchy. 

Another problem with Powell's estimator is finding a good estimator of the 
asymptotic variance-covariance matrix that does not require the knowledge 
of the true distribution of the error. Powell (1983) proposed a consistent 
estimator. 

Powell observed that his proof of the consistency and asymptotic normality 
of the LAD estimator generally holds even if the errors are heteroscedastic. 
This fact makes Powell's estimator even more attractive because the usual 
estimators are inconsistent under heteroscedastic errors, as noted earlier. 

Another obvious way to handle nonnorrmality is to specify a nonnormal 
distribution for the u, in (10.2.3) and use the MLE. See Amemiya and Boskin 
(1974), who used a lognormal distribution with upper truncation to analyze 
the duration of welfare dependency. 


10.6 Generalized Tobit Models 


As stated in Section 10.1, we can classify Tobit models into five common types 
according to similarities in the likelihood function. Type 1 is the standard 
Tobit model, which we have discussed in the preceding sections. In the follow- 
ing sections we shall define and discuss the remaining four types of Tobit 
models. 

It is useful to characterize the likelihood function of each type of model 
schematically as in Table 10.2, where each y,, j = 1, 2, and 3, is assumed to be 
distributed as N(x;fj, 0?), and P denotes a probability or a density or a 
combination thereof. We are to take the product of each P over the observa- 
tions that belong to a particular category determined by the sign of y, . Thus, in 
Type 1 (standard Tobit model), P(y, < 0) - P(y,) is an abbreviated notation 
for To P(yf; <0) - II, fi (y), where fi; is the density of N(x,,,, 02). This 
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Table 10.2 Likelihood functions of the five types of Tobit models 
Type Likelihood function 


1 P(y, <0) - P(yi) 

2 P(y, € 0): P(y, > 0, y2) 

3 P(y, <0) + Py, y) 

4 P(yy <0, y) : PO, y2) 

5 P(yy <0, y3) : Pn > 0, y2) 


expression can be rewritten as (10.2.5) after dropping the unnecessary sub- 
script 1. 

Another way to characterize the five types is by the classification ofthe three 
dependent variables that appear in Table 10.3. In each type of model, the sign 
of y, determines one of the two possible categories for the observations, and a 
censored variable is observed in one category and unobserved in the other. 
Note that when y, is labeled C, it plays two roles: the role of the variable the 
sign of which determines categories and the role of a censored variable. 

We allow for the possibility that there are constraints among the parameters 
of the model (fj, o? ), J= 1, 2, or 3. For example, constraints will occur if the 
original model is specified as a simultaneous equations model in terms of 
Yı» Y2, and y,. Then the fl's denote the reduced-form parameters. 

We shall not discuss models in which there is more than one binary variable 
and, hence, models the likelihood function of which consists of more than two 
components. Such models are computationally more burdensome because 
they involve double or higher-order integration of joint normal densities. The 
only exception occurs in Section 10.10.6, which includes models that are 


Table 10.3 Characterization of the five types of Tobit models 


Dependent variables 


Type » y Vs 
1 C — — 
2 B [6 — 
3 C C — 
4 C C C 
5 B C C 


Note: C — censored; B — binary. 
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obvious generalizations of the Type 5 Tobit model. Neither shall we discuss a 
simultaneous equations Tobit model of Amemiya (1974b). The simplest two- 
equation case of this model is defined by y,; = max (yj yz; + X18, + 4,;,0) and 
Yai = MAX (y;yy + X5f; + w;;, 0), where (u,,, uz) are bivariate normal and 
7 y2 € 1 must be assumed for the model to be logically consistent.!° A sche- 
matic representation of the likelihood function of this two-equation model is 


P(y, y): Py <0, y) Py, «0, y) P, «0 y4 < 0) 
with y's appropriately defined. 


10.7 Type 2 Tobit Model: P(y, < 0) - P(y, > 0, y2) 
10.7.1 Definition and Estimation 
The Type 2 Tobit model is defined as follows: 
yt 7 Xufh + ui, (10.7.1) 
YR = Xuf t wi 
yu7yh if y520 
=0 if y%=0, i=1,2,...,%, 


where {1,;, ua} are i.i.d. drawings from a bivariate normal distribution with 
zero mean, variances g? and c2, and covariance o;z. It is assumed that only 
the sign of yf; is observed and that y?, is observed only when yf, > 0. It is 
assumed that x, are observed for all i but that x;; need not be observed for i 
such that yf, = 0. We may also define, as in (10.4.3), 


w;=1 if y$»0 (10.7.2) 
=0 if y$s0. 


Then {w,;, y;;) constitute the observed sample of the model. It should be 
noted that, unlike the Type 1 Tobit, y;, may take negative values.!! As in 
(10.2.4), y, = 0 merely signifies the event yf, = 0. 

The likelihood function of the model is given by 


L= [Por = 0) [[fGlyt; > POT > 0), (10.7.3) 
1 
where IIy and II, stand for the product over those i for which y, = 0 and 


Yu # 0, respectively, and f( - |yf; > 0) stands for the conditional density of y? 
given yf, > 0. Note the similarity between (10.4.1) and (10.7.3). As in Type 1 
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Tobit, we can obtain a consistent estimate of f, /c, by maximizing the probit 
part of (10.7.3). 


Probit L = [] PO% = 0) [] Po% > 0). (10.7.4) 
0 1 


Also, (10.7.4) is a part of the likelihood function for every one of the five types 
of models; therefore a consistent estimate of f, /c, can be obtained by the 
probit MLE in each of these types of models. 

We can rewrite (10.7.3) as 


L= [Pot £0) II f fti ya) dyt, (10.7.5) 


where f( * , - ) denotes the joint density of yf; and y3,. We can write the joint 
density as the product of a conditional density and a marginal density, that is 
fts ya) — L(Y TM Y21) (2), and can determine a specific form for f(y fi] yz) 
from the well-known fact that the conditional distribution of y¥, given 
y$ = yg is normal with mean xj;f, + 0,05? (y;; — x;B;) and variance 
a? —01,03?. Thus we can further rewrite (10.7.5) as 


L-][[t!-96G18,:1?] (10.7.6) 
Ü 
XII o ([xif,01! + 0150102 ( yy; — Xah] 


X[1 — 01201?02?] 2307 [o7 (ya; — Xup). 


Note that L depends on c, only through f,o; ! and 2,;0; '; therefore, if there is 
no constraint on the parameters, we can put v, = 1 without any loss of gener- 
ality. Then the remaining parameters can be identified. If, however, there is at 
least one common element in fl, and f), c, can also be identified. 

We shall show how Heckman's two-step estimator can be used in this 
model. To obtain an equation comparable to (10.4.11), we need to evaluate 
E( y3|yf; > 0). For this purpose we use 


y3 = xf t 0501 (ti Xh) + Cais (10.7.7) 


where C»; is normally distributed independently of yf; with zero mean and 
variance 02 — 01,01?. Using (10.7.7), we can express E( yf y; > 0) as a sim- 
ple linear function of E( yf,|yf, > 0), which was already obtained in Section 
10.4. Using (10.7.7), we can also derive V( yE|yf; > 0) easily. Thus we obtain 


Yai = Xf; + 0:01 A(x1;0) + €; (10.7.8) 
for i such that yj; # 0, 
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where œ, = flo!, Ee; = 0, and 
Ve; = 03 — 01,01? [xa A(x1;0) + A(x} ,0,)"]. (10.7.9) 


As in the case of the Type 1 Tobit, Heckman’s two-step estimator is the LS 
estimator applied to (10.7.8) after replacing œ, with the probit MLE. The 
asymptotic distribution of the estimator can be obtained in a manner similar 
to that in Section 10.4.3 by defining nx in the same way as before. It was first 
derived by Heckman (1979). 

The standard Tobit (Type 1) is a special case of Type 2 in which y$ = y};. 
Therefore (10.7.8) and (10.7.9) will be reduced to (10.4.11) and (10.4.12) by 
putting x1;f = xf and o1 = o2 = 042. 

A generalization of the two-step method applied to (10.4.23) can easily be 
defined for this model but will not be discussed. 

Note that the consistency of Heckman's estimator does not require the joint 
normality of yt and y? provided that yf is normal and that Eq. (10.7.7) holds 
with ¢, independently distributed of yf but not necessarily normal (Olsen, 
1980). For then (10.7.8) would still be valid. As pointed out by Lee (1982c), 
the asymptotic variance-covariance matrix of Heckman's estimator can be 
consistently estimated under these less restrictive assumptions by using 
White's estimator analogous to the one mentioned after Eq. (10.4.22). Note 
that White's estimator does not require (10.7.9) to be valid. 


10.7.2 A Special Case of Independence 


Dudley and Montmarquette (1976) analyzed whether or not the United States 
gives foreign aid to a particular country and, if it does, how much foreign aid it 
gives using a special case of the model (10.7.1), where the independence of v; 
and uz is assumed. In their model the sign of yf; determines whether aid is 
given to the ith country, and y#, determines the actual amount of aid. They 
used the probit MLE to estimate f, (assuming c, = 1) and the least squares 
regression of y;, on Xy to estimate fl;. The LS estimator of f; is consistent in 
their model because of the assumed independence between u,, and u2,. This 
makes their model computationally advantageous. However, it seems unreal- 
istic to assume that the potential amount of aid, y$, is independent of the 
variable that determines whether or not aid is given, yf. This model is the 
opposite extreme of the Tobit model, which can be regarded as a special case of 
Type 2 model where there is total dependence between yf and y?, in the whole 
spectrum of models (with correlation between yf and y? varying from — | to 
+ 1) contained in Type 2. 
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Because of the computational advantage mentioned earlier this *indepen- 
dence” model and its variations were frequently used in econometric applica- 
tions in the 1960s and early 1970s. In many of these studies, authors made the 
additional linear probability assumption: P( yf; > 0) = x{,8,, which enabled 
them to estimate fl, (as well as 8.) consistently by the least squares method. For 
examples of these studies, see the articles by Huang (1964) and Wu (1965). 


10.7.3 Gronau's Model 


Gronau (1973) assumed that the offered wage W° is given to each housewife 
independently of hours worked (H ), rather than as a schedule W°(/7). Given 
W^, a housewife maximizes her utility function U(C, X) subject to X= 
W°H + V and C+ H=T, where C is time spent at home for childcare, X 
represents all other goods, T is total available time, and V is other income. 
Thus a housewife does not work if 


au (àU Y"! > 
E (25) l^ W (10.7.10) 


and works if the inequality in (10.7.10) is reversed. If she works, the hours of 
work H and the actual wage rate W must be such that 


ðU ( UM! 
ac (s) =w 
Gronau called the left-hand side of (10.7.10) the housewife’s value of time or, 
more commonly, the reservation wage, denoted W*.!2 
Assuming that both W° and W" can be written as linear combinations of 


independent variables plus error terms, his model may be statistically de- 
scribed as follows: 


W? = xfa + up, (10.7.11) 
Wi = Za t v; 
W,-W? if WW 

=0 if Wo9sWi, i=1,2,...,%, 


where (ux, v;) are i.i.d. drawings from a bivariate normal distribution with 
zero mean, variances c2 and o2, and covariance C. Thus the model can be 
written in the form of (10.7.1) by putting W? — W: = yf,and W? = y$. Note 
that H (hours worked) is not explained by this statistical model although it is 
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determined by Gronau’s theoretical model. A statistical model explaining H 
as well as W was developed by Heckman (1974) and will be discussed in 
Section 10.8.2. 

Because the model (10.7.11) can be transformed into the form (10.7.1) in 
such a way that the parameters of (10.7.11) can be determined from the 
parameters of (10.7.1), all the parameters of the model are identifiable except 
V(W? — Wt), which can be set equal to 1 without loss of generality. If, how- 
ever, at least one element of x,, is not included in z,, all the parameters are 
identifiable. They can be estimated by the MLE or Heckman’s two-step 
estimator by procedures described in Section 10.7.1. We can also use the 
probit MLE (the first step of Heckman's two-step) to estimate a certain subset 
of the parameters.!4 


10.7.4 Other Applications of the Type 2 Tobit Model 


Nelson (1977) noted that a Type 2 Tobit model arises if yy in (10.2.1) is 
assumed to be a random variable with its mean equal to a linear combination 
of independent variables. He reestimated Gronau's model by the MLE. 

In the study of Westin and Gillen (1978), y$represents the parking cost with 
x, including zonal dummies, wage rate (as a proxy for value of walking time), 
and the square of wage rate. A researcher observes y$ = y; if y? < C where C 
represents transit cost, which itself is a function of independent variables plus 
an error term. 


10.8 Type 3 Tobit Model: P(y, < 0) - P(y,, Y2) 
10.8.1 Definition and Estimation 
The Type 3 Tobit model is defined as follows: 
yt = xui + twi (10.8.1) 
y5 = Xxàiff + wi 
Yu^yh if yt20 
=0 if yfs0 
Yu=yi if yf >0 
=0 if y},=0, i-1,2,...,n, 


where (t4, 14;) are i.i.d. drawings from a bivariate normal distribution with 
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zero mean, variances a? and o2, and covariance o,,. Note that this model 
differs from Type 2 only in that in this model y? is also observed when it is 
positive. 

Because the estimation of this model can be handled in a manner similar to 
the handling of Type 2, we shall discuss it only briefly. Instead, in the following 
we shall give a detailed discussion of the estimation of Heckman's model 
(1974), which constitutes the structural equations version of the model 
(10.8.1). 

The likelihood function of the model (10.8.1) can be written as 


L= I P(yt, = 0) IL. Jai), (10.8.2) 


where f( - , - )isthe joint density of yf; and y¥,. Because y$ is observed when it 
is positive, all the parameters of the model are identifiable, including o2. 

Heckman's two-step estimator was originally proposed by Heckman 
(19762) for this model. Here we shall obtain two conditional expectation 
equations, (10.4.11) and (10.7.8), for y, and y;, respectively. [Add subscript 1 
to all the variables and the parameters in (10.4.11) to conform to the notation 
of this section.] In the first step of the method, a, = fl, o1 is estimated by the 
probit MLE @,. In the second step, least squares is applied separately to 
(10.4.11) and (10.7.8) after replacing a, by &,. The asymptotic variance-co- 
variance matrix of the resulting estimates of (f, s, ) is given in (10.4.22) and 
that for (2, 0,501!) can be similarly obtained. The latter is given by Heck- 
man (1979). A consistent estimate of o; can be obtained using the residuals of 
Eq. (10.7.8). As Heckman (1976a) suggested and as was noted in Section 
10.4.3, a more efficient WLS can be used for each equation in the second step 
of the method. An even more efficient GLS can be applied simultaneously to 
the two equations. However, even GLS is not fully efficient compared to 
MLE, and the added computational burden of MLE may be sufficiently 
compensated for by the gain in efficiency. A two-step method based on un- 
conditional means of y, and y, which is a generalization of the method 
discussed in Section 10.4.3, can also be used for this model. 

Wales and Woodland (1980) compared the LS estimator, Heckman's two- 
step estimator, probit MLE, conditional MLE (using only those who worked), 
MLE, and another inconsistent estimator in a Type 3 Tobit model in a 
simulation study with one replication (sample size 1000 and 5000). The 
particular model they used is the labor supply model of Heckman (1974), 
which will be discussed in the next subsection.5 The LS estimator was found 
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to be poor, and all three ML estimators were found to perform well. Heck- 
man’s two-step estimator was ranked somewhere between LS and MLE. 


10.8.2 Heckman’s Model 


Heckman's model (Heckman, 1974) differs from Gronau’s model (10.7.11) in 
that Heckman included the determination of hours worked (H) in his 
model.!6 Like Gronau, Heckman assumes that the offered wage W° is given 
independently of H, therefore Heckman's W° equation is the same as 
Gronau's: 


W? = xf. + uz. (10.8.3) 
Heckman defined Wt = (0U/A8C)/(8U/AX) and specified" 
Wi = 3H, t zia + y. (10.8.4) 


It is assumed that the ith individual works if 
W:i(H,;—0)2zja t v <W? (10.8.5) 


and then the wage W; and hours worked H; are determined by solving (10.8.3) 
and (10.8.4) simultaneously after putting We = W5 = W,. Thus we can de- 
fine Heckman's model as 


W, = xy, + ta; (10.8.6) 
and 

W, = yH; + za + v, (10.8.7) 
for those i for which desired hours of work 

Hf = x18, + us 0, (10.8.8) 


where x},8, = y^! (x5, — zia) and u,; = y~! (uz — v;). Note that (10.8.5) and 
(10.8.8) are equivalent because y > 0. 

Call (10.8.6) and (10.8.7) the structural equations; then (10.8.6) and the 
identity part of (10.8.8) constitute the reduced-form equations. The reduced- 
form equations of Heckman's model can be shown to correspond to the Type 
3 Tobit model (10.8.1) if we put H* = yf, H = y,, W° = yf, and W = y}. 

We have already discussed the estimation of the reduced-form parameters 
in the context of the model (10.8.1), but we have not discussed the estimation 
ofthe structural parameters. Heckman (1974) estimated the structural param- 
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eters by MLE. In the next two subsections we shall discuss three alternative 
methods of estimating the structural parameters. 


10.8.3 Two-Step Estimator of Heckman 


Heckman (1976a) proposed the two-step estimator of the reduced-form pa- 
rameters (which we discussed in Section 10.8.1); but he also reestimated the 
labor supply model of Heckman (1974) using the structural equations version. 
Because (10.8.6) is a reduced-form as well as a structural equation, the estima- 
tion of fl, is done in the same way as discussed in Section 10.8.1, namely, by 
applying least squares to the regression equation for E(W;| Hf > 0) after esti- 
mating the argument of A (the hazard rate) by probit MLE. So we shall discuss 
only the estimation of (10.8.7), which we can rewrite as 


H= y W,—zay'-—yvy. (10.8.9) 


By subtracting E(v;|H* > 0) from v; and adding the same, we can rewrite 
(10.8.9) further as 


H,—- Y W,—zay!—0,9;' y Apo) -Y !€, (10.8.10) 


where Gi, = Cov (ti vj), 02 = Vu, and e; = v, — E(v,\H* > 0). Then the 
consistent estimates of )^!, ay !, and a,,0, !y ! are obtained by the least 
squares regression applied to (10.8.10) after replacing f, /c, by its probit MLE 
and W, by W,, the least squares predictor of W, obtained by applying Heck- 
man's two-step estimator to (10.8.6). The asymptotic variance-covariance 
matrix of this estimator can be deduced from the results in the article by 
Heckman (1978), who considered the estimation of a more general model 
(which we shall discuss in the section on Type 5 Tobit models). 

Actually, there is no apparent reason why we must first solve (10.8.7) for H, 
and proceed as indicated earlier. Heckman could just'as easily have subtracted 
and added E(v|H? > 0) to (10.8.7) itself and proceeded similarly. This 
method would yield alternative consistent estimates. Inferring from the well- 
known fact that the two-stage least squares estimates of the standard simulta- 
neous equations model yield asymptotically equivalent estimates regardless of 
which normalization is chosen, the Heckman two-step method applied to 
(10.8.7) and (10.8.9) should also yield asymptotically equivalent estimates of y 
and a. 

Lee, Maddala, and Trost (1980) extended Heckman's simultaneous equa- 
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tions two-step estimator and its WLS version (taking account of the hetero- 
scedasticity) to more general simultaneous equations Tobit models and ob- 
tained their asymptotic variance-covariance matrices. 


10.8.4 Amemiya’s Least Squares and Generalized Least 
Squares Estimators 


Amemiya (1978c, 1979) proposed a general method of obtaining the esti- 
mates of the structural parameters from given reduced-form parameter esti- 
mates in general Tobit-type models and derived the asymptotic distribution. 
Suppose that a structural equation and the corresponding reduced-form 
equations are given by 


y-Yy t Xf tu (10.8.11) 
Ly, Y] = X[z, IT] + V, 


where X, is a subset of X. Then the structural parameters y and fare related to 
the reduced-form parameters 7t and II in the following way: 


z—IIy t Jf, (10.8.12) 


where J is a known matrix consisting of only ones and zeros. If, for example, 
X, constitutes the first K, columns of X (K= K, + K;), then we have 
J = (I, 0)’, where I is the identity matrix of size K, and 0 is the K, X K, matrix 
of zeros. It is assumed that x, y, and Bare vectors and II and J are matrices of 
conformable sizes. Equation (10.8.12) holds for Heckman's model and more 
general simultaneous equations Tobit models, as well as for the standard 
simultaneous equations model (see Section 7.3.6). 

Now suppose certain estimates zt and I of the reduced-form parameters are 
given. Then, using them, we can rewrite (10.8.12) as 


f= fiy + JB + (& — x) — (fl — y. (10.8.13) 


Amemiya proposed applying LS and GLS estimation to (10.8.13). From 
Amemiya's result (Amemiya, 1978c), we can infer that Amemiya's GLS 
applied to Heckman's model yields more efficient estimates than Heckman's 
simultaneous equations two-step estimator discussed earlier. Amemiya 
(1983b) showed the superiority of the Amemiya GLS estimator to the WLS 
version of the Lee-Maddala-Trost estimator in a general simultaneous equa- 
tions Tobit model. 
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10.8.5 Other Examples of Type 3 Tobit Models 


Roberts, Maddala, and Enholm (1978) estimated two types of simultaneous 
equations Tobit models to explain how utility rates are determined. One of 
their models has a reduced form that is essentially Type 3 Tobit; the other is a 
simple extension of Type 3. 

The structural equations of their first model are 


Y% xau + w; (10.8.14) 
and 
y3 = VET Xy + us (10.8.15) 


where y3; is the rate requested by the ith utility firm, y$ is the rate granted for 
the ith firm, x;, includes the embedded cost of capital and the last rate granted 
minus the current rate being earned, and x; includes only the last variable 
mentioned. It is assumed that y$; and y}, are observed only if 


yt; ^ zia t vj» 0, (10.8.16) 


where z; include the earnings characteristics of the ith firm. (Vv, is assumed to 
be unity.) The variable y* may be regarded as an index affecting a firm'deci- 
sion as to whether or not it requests a rate increase. The model (10.8.14) and 
(10.8.15) can be labeled as P(y, « 0) - P(y, > 0, y2, y3) in our shorthand 
notation and therefore is a simple generalization of Type 3. The estimation 
method of Roberts, Maddala, and Enholm is that of Lee, Maddala, and Trost 
(1980) and can be described as follows: 

Step 1. Estimate a by the probit MLE. 

Step 2. Estimate f, by Heckman’s two-step method. 

Step 3. Replace y$, in the right-hand side of (10.8. 15) by f$; obtained in step 
2 and estimate y and fl, by the least squares applied to (10.8.15) after adding 
the hazard rate term E(u;,|y%, > 0). 

The second model of Roberts, Maddala, and Enholm is the same as the first 
model except that (10.8.16) is replaced by 


yt. > R (10.8.17) 


where R, refers to the current rate being earned, an independent variable. Thus 
this model is essentially Type 3. (It would be exactly Type 3 if R; — 0.) The 
estimation method is as follows: 

Step 1. Estimate fl; by the Tobit MLE. 

Step 2. Repeat step 3 described in the preceding paragraph. 
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Nakamura, Nakamura, and Cullen (1979) estimated essentially the same 
model as Heckman (1974) using Canadian data on married women. They 
used the WLS version of Heckman’s simultaneous equations two-step esti- 
mators, that is, they applied WLS to (10.8.10). 

Hausman and Wise (1976, 1977, 1979) used Type 3 and its generalizations 
to analyze the labor supply of participants in the negative income tax (NIT) 
experiments. Their models are truncated models because they used observa- 
tions on only those persons who participated in the experiments. The first 
model of Hausman and Wise (1977) isa minor variation of the standard Tobit 
model, where earnings Y follow 


Y,o-Y* if Yf«L,  Yt- Nx, o?), (10.8.18) 


where L, is a (known) poverty level that qualifies the ith person to participate 
in the NIT program. It varies systematically with family size. The model is 
estimated by LS and MLE. (The LS estimates were always found to be smaller 
in absolute value, confirming Greene's result given in Section 10.4.2.) In the 
second model of Hausman and Wise (1977), earnings are split into wage and 
hoursas Y = W - H, leading to the same equations as those of Heckman (Eqs. 
10.8.6 and 10.8.7) except that the conditioning event is 


log W; + log H, < log L; (10.8.19) 


instead of (10.8.8). Thus this model is a simple extension of Type 3 and 
belongs to the same class of models as the first model of Roberts, Maddala, and 
Enholm (1978), which we discussed earlier, except for the fact that the model 
of Hausman and Wise is truncated. The model of Hausman and Wise (1979) 
isalso ofthis type. The model presented in their 1976 article is an extension of 
(10.8.18), where earnings observations are split into the preexperiment (sub- 
script 1) and experiment (subscript 2) periods as 


Yi = Yt and Yz; = Y? if Yt, « L. (10.8.20) 


Thus the model is essentially Type 3, except for a minor variation due to the 
fact that L, varies with i. 


10.9 Type 4 Tobit Model: P(y, < 0, y3) - P(y1, Y2) 
10.9.1 Definition and Estimation 


The Type 4 Tobit model is defined as follows: 
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yt; = xui + wi (10.9.1) 


MX 
we 
I 
Vel 
ws 
D 
+ 
g 
2 


Yu yt if y$5 20 


= if y¥>0, i-1,2,...,n, 


where {t,;, Uzi, U3;) are i.i.d. drawings from a trivariate normal distribution. 

This model differs from Type 3 defined by (10.8.1) only by the addition of 
y$, which is observed only if y; = 0. The estimation of this model is not 
significantly different from that of Type 3. The likelihood function can be 
written as 


0 
L= II [sor Yai) dyti IIo. Ya); (10.9.2) 


where f;( * , - ) is the joint density of yf; and y$, and f( - , - ) is the joint 
density of yf; and y%,. Heckman's two-step method for this model is similar to 
the method for the preceding model. However, we must deal with three 
conditional expectation equations in the present model. The equation for y; 
will be slightly different from the other two because the variable is nonzero 
when yf, is nonpositive. We obtain 


E(yslyf; = 0) = xy, — 0401 A7 xifi/01). (10.9.3) 
We shall discuss three examples of the Type 4 Tobit model in the following 
subsections: the model of Kenny et al. (1979); the model of Nelson and Olson 
(1978); and the model of Tomes (1981). In the first two models the y* equa- 
tionsare written as simultaneous equations, like Heckman's model (1974), for 
which the reduced-form equations take the form of (10.9.1). Tomes' model 
has a slight twist. The estimation of the structural parameters of such models 
can be handled in much the same way as the estimation of Heckman's model 
(1974), that is, by either Heckman's simultaneous equations two-step method 
(and its Lee-Maddala-Trost extension) or by Amemiya's LS and GLS, both of 
which were discussed in Section 10.8. 
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In fact, these two estimation methods can easily accommodate the follow- 
ing very general simultaneous equations Tobit model: 


I’y¥=B’x,;t+u,,  i—12,...,n (10.9.4) 


where the elements of the vector yf contain the following three classes of 
variables: (1) always completely observable, (2) sometimes completely observ- 
able and sometimes observed to lie in intervals, and (3) always observed to lie 
in intervals. Note that the variable classified as Cin Table 10.3 belongs to class 
2 and the variable classified as B belongs to class 3. The models of Heckman 
(1974), Kenny et al. (1979), and Nelson and Olson (1978), as well as a few 
more models discussed under Type 5, such as that of Heckman (1978), are all 
special cases of the model (10.9.4). 


10.9.2 Model of Kenny, Lee, Maddala, and Trost 


Kenny et al. (1979) tried to explain earnings differentials between those who 
went to college and those who did not. We shall explain their model using the 
variables appearing in (10.9.1). In their model, yf refers to the desired years of 
college education, y7to the earnings of those who go to college, and y$to the 
earnings of those who do not go to college. A small degree of simultaneity is 
introduced into the model by letting yf'appear in the right-hand side of the y? 
equation. Kenny and his coauthors used the MLE. They noted that the MLE 
iterations did not converge when started from the LS estimates but did con- 
verge very fast when started from Heckman's two-step estimates (simulta- 
neous equations version). 


10.9.3 Model of Nelson and Olson 


The empirical model actually estimated by Nelson and Olson (1978) is more 
general than Type 4 and is a general simultaneous equations Tobit model 
(10.9.4). The Nelson-Olson empirical model involves four elements of the 
vector y*: 


yf Time spent on vocational school training, completely observed if 
yf > 0, and otherwise observed to lie in the interval (—~, 0] 

yz Time spent on college education, observed to lie in one of the three 
intervals (—9, 0], (0, 1], and (1, œ) 

ył Wage, always completely observed 

yf Hours worked, always completely observed 
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These variables are related to each other by simultaneous equations. How- 
ever, they merely estimate each reduced-form equation seperately by various 
appropriate methods and obtain the estimates of the structural parameters 
from the estimates of the reduced-form parameters in an arbitrary way. 

The model that Nelson and Olson analyzed theoretically in more detail is 
the two-equation model: 


Yh na t Xp, + vj (10.9.5) 
and 
Yu = Yjyf; + X303 + vj, (10.9.6) 


where yz is always observed and y; is observed to be yif yf, > 0. This model 
may be used, for example, if we are interested in explaining only yfand yin 
the Nelson-Olson empirical model. The likelihood function ofthis model may 
be characterized by P(y, <0, y2) © P(yi, y;), and therefore, the model is a 
special case of Type 4. 

Nelson and Olson proposed estimating the structural parameters of this 
model by the following sequential method: 

Step 1. Estimate the parameters of the reduced-form equation for y*by the 
Tobit MLE and those of the reduced-form equation for y; by LS. 

Step 2. Replace yz in the right-hand side of (10.9.5) by its LS predictor 
obtained in step 1 and estimate the parameters of (10.9.5) by the Tobit MLE. 

Step 3. Replace yf, in the right-hand side of (10.9.6) by its predictor ob- 
tained in step 1 and estimate the parameters of (10.9.6) by LS. 
Amemiya (1979) obtained the asymptotic variance-covariance matrix of the 
Nelson-Olson estimator and showed that the Amemiya GLS (see Section 
10.8.4) based on the same reduced-form estimates is asymptotically more 
efficient. 


10.9.4 Model of Tomes 


Tomes (1981) studied a simultaneous relationship between inheritance and 
the recipient's income. Although it is not stated explicitly, Tomes' model can 
be defined by 


Yt Yat Xufh + us, (10.9.7) 
Yu = YaYyu + Xiha + Uzis (10.9.8) 


and 
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Yu Vii if yt 20 (10.9.9) 
=0 if y$ 80, 


where yf, is the potential inheritance, y,, is the actual inheritance, and y,,is the 
recipient's income. Note that this model differs from Nelson's model defined 
by (10.9.5) and (10.9.6) only in that y,,, not y*,, appears in the right-hand side 
of (10.9.8). Assuming y, 7, < 1 for the logical consistency of the model (as in 
Amemiya, 1974b, mentioned in Section 10.6), we can rewrite (10.9.7) as 


yt70-7nxY' Be + Uz) + xf, tu] (10.9.10) 
and (10.9.8) as 
Yar = YP = (0 — 33) rex, + us) + xs + us] (10.9.11) 
if yt? 0, 


= 0 = : 
= yg = Xy + wu if yt s0. 


Thus the likelihood function of the model is 


0 
L- I Í ft YD) dyt II yQ9» (10.9.12) 


which is the same as (10.9.2). 


10.10 Type 5 Tobit Model: P(y, < 0, y3) - P(y, > 0, yo) 
10.10.1 Definition and Estimation 


The Type 5 Tobit model is obtained from the Type 4 model (10.9.1) by 
omitting the equation for y,;. We merely observe the sign of y},. Thus the 
model is defined by 


yt = XB, + wi (10.10.1) 


5; = X38 + usi 
Yai = V2, if yt20 
=0 if yt s0 
yu y$ if yt, s0 
=0 if yf,>0, i—1,2,...,n, 
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where (1,;, Uz; 13) are i.id. drawings from a trivariate normal distribution. 
The likelihood function of the model is 


0 eo 
L-|]I Í AOT ys) dhll | ROT. ya) dyt (10.10.2) 


where f, and f, are as defined in (10.9.2). Because this model is somewhat 
simpler than Type 4, the estimation methods discussed in the preceding 
section apply to this model a fortiori. Hence, we shall go immediately into the 
discussion of applications. 


10.10.2 Model of Lee 


In the model of Lee (1978), y$ represents the logarithm of the wage rate of the 
ith worker in case he or she joins the union and y$, represents the same in case 
he or she does not join the union. Whether or not the worker joins the union is 
determined by the sign of the variable 


VE Y5— y$ t za v. (10.10.3) 


Because we observe only y$ if the worker joins the union and y% if the worker 
does not, the logarithm of the observed wage, denoted y,, is defined by 


y=y%4 if yf>0 (10.10.4) 
=ys if yf$s0. 


Lee assumed that x, and x, (the independent variables in the yf and yf 
equations) include the individual characteristics of firms and workers such as 
regional location, city size, education, experience, race, sex, and health, 
whereas z includes certain other individual characteristics and variables that 
represent the monetary and nonmonetary costs of becoming a union member. 
Because yfis unobserved except for the sign, the variance of yf'can be assumed 
to be unity without loss of generality. 

Lee estimated his model by Heckman’s two-step method applied separately 
to the y? and y? equations. In Lee's model simultaneity exists only in the yf 
equation and hence is ignored in the application of Heckman's two-step 
method. Amemiya's LS or GLS, which accounts for the simultaneity, will, of 
course, work for this model as well, and the latter will yield more efficient 
estimates — although, of course, not as fully efficient as the MLE. 
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10.10.3 Type 5 Model of Heckman 


The Type 5 model of Heckman (1978) is a simultaneous equations model 
consisting of two equations 


yt = Mai + Xii + Ow, + uy (10.16.5) 
and 
Yai = Yay + Xub + ÔW; + us, (10.10.6) 
where we observe Yy, Xii Xy, and w, defined by 
w=1 if yi>0 (10.10.7) 
=0 if yf s0. 


There are no empirical results in the 1978 article, but the same model was 
estimated by Heckman (1976b); in this application y% represents the average 
income of black people in the ith state, y; the unobservable sentiment toward 
blacks in the ith state, and w; = 1 if an antidiscrimination law is instituted in 
the ith state. 

When we solve (10.10.5) and (10.10.6) for y$, the solution should not 
depend upon w,, for that would clearly lead to logical inconsistencies. There- 
fore we must assume 


715, + à =0 (10.10.8) 


for Heckman’s model to be logically consistent. Using the constraint 
(10.10.8), we can write the reduced-form equations (although strictly speaking 
not reduced-form because of the presence of w;) of the model as 


yt Xi, t Uy (10.10.9) 
and 
Yai = Ô2W; + Xj + Vj, (10.10.10) 


where we can assume Vv,, = 1 without loss of generality. Thus Heckman’s 
model is a special case of Type 5 with just a constant shift between y and ył 
(that is, y, = xim, + v, and y$ = à, + xit, + vy). Moreover, if ô = 0, itisa 
special case of Type 5 where y7= y}. 

Let us compare Heckman’s reduced-form model defined by (10.10.9) and 
(10.10.10) with Lee’s model. Equation (10.10.9) is essentially the same 
as (10.10.3) of Lee’s model. Equation (10.10.4) of Lee’s model can be rewrit- 
ten as 
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y; = wil t uz) + (0 — wx + us) (10.10.11) 
= Xjifs + Us; + wi(X5)f + Ua; — xus — Uai). 


By comparing (10.10.10) and (10.10.11), we readily see that Heckman's re- 
duced-form model is a special case of Lee's model in which the coefficient 
multiplied by w; is a constant. 

Heckman proposed a sequential method of estimation for the structural 
parameters, which can be regarded as an extension of Heckman's simulta- 
neous equations two-step estimation discussed in Section 10.8.3. His method 
consists of the following steps: 

Step 1. Estimate z, by applying the probit MLE to (10.10.9). Denote the 
estimator £, and define Ê; = F(x/7, ). 

Step 2. Insert (10.10.9) into (10.10.6), replace zt, with zt, and w; with Ê, and 
then estimate y2, f;, and 6, by least squares applied to (10.10.6). 

Step 3. Solve (10.10.5) for ya, eliminate y*; by (10.10. 9), and then apply 
least squares to the resulting equation after replacing x, by 7, and w; by Ê, to 
estimate yj !, y1 ! f, and y1!ó,. 

Amemiya (1978c) derived the asymptotic variance-covariance matrix of 
Heckman's estimator defined in the preceding paragraph and showed that 
Amemliya's GLS (defined in Section 10.8.4) applied to the model yields an 
asymptotically more efficient estimator in the special case of ô, = 6, = 0. As 
pointed out by Lee (1981), however, Amemiya’s GLS can also be applied to 
the model with nonzero ó's as follows: 

Step 1. Estimate x, by the probit MLE 7, applied to (10.10.9). 

Step 2. Estimate o and z, by applying the instrumental variables method 
to (10. 10.10), using Ê, as the instrument for w;. Denote these estimators as à, 
and z.. 

Step 3. Derive the estimates of the structural parameters y,, f, 0, 72, f, 
and ô, from 7t , 7t, and ô, using the relationship between the reduced-form 
parameters and the structural parameters as well as the constraint (10.10.8) in 
the manner described in Section 10.8.4. 

The resulting estimator can be shown to be asymptotically more efficient than 
Heckman's estimator. 


10.10.4 Disequilibrium Models 


Disequilibrium models constitute an extensive area of research, about which 
numerous papers have been written. Some of the early econometric models 
have been surveyed by Maddala and Nelson (1974). A more extensive and 
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up-to-date survey has been given by Quandt (1982). See, also, the article by 
Hartley (1976a) for a connection between a disequilibrium model and the 
standard Tobit model. Here we shall mention two basic models first discussed 
in the pioneering work of Fair and Jaffee (1972). 

The simplest disequilibrium model of Fair and Jaffee is a special case of the 
Type 5 model (10.10.1), in which y$; is the quantity demanded in the ith 
period, y% is the quantity supplied in the ith period, and yf, = y$, — y3,. Thus 
the actual quantity sold, which a researcher observes, is the minimum of 
supply and demand. The fact that the variance-covariance matrix of 
(yf, v3, yf) is only of rank 2 because ofthe linear relationship above does not 
essentially change the nature of the model because the likelihood function 
(10.10.2) involves only bivariate densities. 

In another model Fair and Jaffee added the price equation to the model of 
the preceding paragraphs as 


yu = Y yi) (10.10.12) 


where y,; denotes a change in the price at the ith period. The likelihood 
function of this model can be written as!? 


0 
L-][I I AOT Vail Vad Sa) dyt (10.10.13) 
0 —» 


X II [ sor. YalYaM A) dyhe 


The form of the likelihood function does not change if we add a normal error 
term to the right-hand side of (10.10.12). In either case the model may be 
schematically characterized by 


P(y, <0, ys, y): PCy, > 0, Yas Ya), (10.10.14) 


which is a simple generalization of the Type 5 model. 


10.10.5 Multivariate Generalizations 


By a multivariate generalization of Type 5, we mean a model in which y2; and 
yž in (10.10.1) are vectors, whereas y*, is a scalar variable the sign of which is 
Observed as before. Therefore the Fair-Jaffee model with likelihood function 
characterized by (10.10.14) is an example of this type of model. 

In Lee's model (1977) the y$ equation is split into two equations 


C$; = XE, + wu; (10.10.15) 
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and 
TH, = 7,0; t v, (10.10.16) 


where C$, and T$, denote the cost and the time incurred by the ith person 
traveling by a private mode of transportation, and, similarly, the cost and the 
time of traveling by a public mode are specified as 


C3, = xs, + us (10.10.17) 
and 
T$ = 2505 + 03. (10.10.18) 


Lee assumed that C$ and T$ are observed if the ith person uses a private mode 
and C3, and T$, are observed if he or she uses a public mode. A private mode is 
used if yf; > 0, where yf, is given by 


yt; = sið, + 6,73, + 6,73, + &4(C$; — Ch) + € (10.10.19) 


Lee estimated his model by the following sequential procedure: 

Step 1. Apply the probit MLE to (10.10.19) after replacing the starred 
variables with their respective right-hand sides. 

Step 2. Apply LS to each of the four equations (10.10.15) through (10.10.18) 
after adding to the right-hand side of each the estimated hazard from step 1. 

Step 3. Predict the dependent variables of the four equations (10.10.15) 
through (10.10.18), usingthe estimates obtained in step 2; insert the predictors 
into (10.10.19) and apply the probit MLE again. 

Step 4. Calculate the MLE by iteration, starting from the estimates obtained 
at the end of the step 3. 

Willis and Rosen (1979) studied earnings differentials between those who 
went to college and those who did not, using a more elaborate model than that 
of Kenny et al. (1979), which was discussed in Section 10.9.2. In the model of 
Kenny et al., y?, (the desired years of college education, the sign of which 
determines whether an individual attends college) is specified not to depend 
directly on y$, and yf, (the earnings of the college-goer and the non-college- 
goer, respectively). The first inclination of a researcher might be to hypothe- 
size y% = yf, — y¥,. However, this would be an oversimplification because the 
decision to go to college should depend on the difference in expected lifetime 
earnings rather than in current earnings. 

Willis and Rosen solved this problem by developing a theory of the maxi- 
mization of discounted, expected lifetime earnings, which led to the following 
model: 
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I}, xf, + ui, (10.10.20) 

Gh, = 10, + vy, (10.10.21) 

I} = xs + ty, (10.10.22) 

G3, = 25,05 + vy, (10.10.23) 
and 

R, = siy + €;, i—-1,2,...,n (10.10.24) 


where 77, and G$ denote the initial earnings (in logarithm) and the growth rate 
of earnings for the college-goer, 77; and G$, denote the same for the non-col- 
lege-goer, and R, denotes the discount rate. It is assumed that the ith person 
goes to college if yf; > 0 where 


and that the variables with subscript 2 are observed if yf, > 0, those with 
subscript 3 are observed if yf, = 0, and R,is never observed. Thus the model is 
formally identical to Lee's model (1977). Willis and Rosen used an estimation 
method identical to that of Lee, given earlier in this subsection. 

Borjas and Rosen (1980) used the same model as Willis and Rosen to study 
the earnings differential between those who changed jobs and those who did 
not within a certain period of observation. 


10.10.6 Multinomial Generalizations 


In all the models we have considered so far in Section 10.10, the sign of yf, 
determined two basic categories of observations, such as union members 
versus nonunion members, states with an antidiscrimination law versus those 
without, or college-goers versus non-college-goers. By a multinomial general- 
ization of Type 5, we mean a model in which observations are classified into 
more than two categories. We shall devote most of this subsection to a discus- 
sion of the article by Duncan (1980). 

Duncan presented a model of joint determination of the location of a firm 
and its input-output vectors. A firm chooses the location for which profits are 
maximized, and only the input-output vector for the chosen location is 
observed. Let s;(k) be the profit of the ith firm when it chooses location k, 
i—-1,2,...,nand k—-1,2,... , K, and let y,(k) be the input-output 
vector for the ith firm at the Ath location. To simplify the analysis, we shall 
subsequently assume y;(k) is a scalar, for a generalization to the vector case is 
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straightforward. It is assumed that 

s(Kk) = xID'B + uy (10.10.26) 
and 

lk) = xR B + va. (10.10.27) 


where x‘!) and x? are vector functions of the input-output prices and eco- 
nomic theory dictates that the same $ appears in both equations.’ It is as- 
sumed that (uj, up, . . . , Wk, Vas Vos» . . , Vx) are iid. drawings from a 
2K-variate normal distribution. Suppose s;(k;) > s,(/) for any j # k;. Then a 
researcher observes y,(k;) but does not observe y,(j) for j # k;. 
For the following discussion, it is useful to define K binary variables for each 
i by 
w(k) = 1 if ith firm chooses kth location (10.10.28) 
=0 otherwise 


and define the vector w,=[w,(1), w,(2),...,w,(K)]’. Also define 
Py = P[w,(k) = 1] and the vector P; = (P4, P2,- . . , Pix)’. 

There are many ways to write the likelihood function of the model, but 
perhaps the most illuminating way is to write it as 


L= Tyk) wlk) = 1], (10.10.29) 


where k; is the actual location the ith firm was observed to choose. 

The estimation method proposed by Duncan can be outlined as follows: 

Step 1. Estimate the f that characterize fin (10.10.29) by nonlinear WLS. 

Step 2. Estimate the f that characterize P in (10.10.29) by the multinomial 
probit MLE using the nonlinear WLS iteration. 

Step 3. Choose the optimum linear combination of the two estimates of f 
obtained in steps 1 and 2. 

To describe step | explicitly, we must evaluate u; = E[y;(k;)|w;(k;) = 1] 
and o? = V[y;(kj)| w,(k;) = 1] as functions of B and the variances and covar- 
iances of the error terms of Eqs. (10.10.26) and (10.10.27). These conditional 
moments can be obtained as follows. Define z,(j) = s;(k;) — s;(j) and the 
(K—1)-vector z,=[z,(1),... ,2;(k;—- 1), z(k t 1,y...,z(K)). To 
simplify the notation, write z; as z, omitting the subscript. Similarly, write 
y,(k;) as y. Also, define R = E(y — Ey\(z — Ez)’ [E(z — Ez)(z — Ez)’ |! and 
Q = Vy — RE(z — Ez)(y — Ey). Then we obtain?? 


Hi = E(y|z > 0) = Ey + RE(z|z > 0) — REz (10.10.30) 
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and 
= V(y|z > 0) = RV(z|z > OR’ + Q. (10.10.31) 


The conditional moments of z appearing in (10.10.30) and (10.10.31) can be 
found in the articles by Amemiya (1974b, p. 1002) and Duncan (1980, p. 850). 
Finally, we can describe the nonlinear WLS iteration of step 1 as follows: 
Estimate c? by inserting the initial estimates (for example, those obtained by 
minimizing [y,(k;) — u;) of the parameters into the right-hand side of 
(10.10.31) — call it 02. Minimize 


D ôr Dika) — ur (10.10.32) 


with respect to the parameters that appear in the right-hand side of (10.10.30). 

Use these estimates to evaluate the right-hand side of (10.10.31) again to get 

another estimate of c2. Repeat the process to yield new estimates of ff. 
Now consider step 2. Define 


z, = E(w; — PjXw, = P,)’ = D; = P,P;, (10.10.33) 


where D, is the K X K diagonal matrix the kth diagonal element of which is 
P. To perform the nonlinear WLS iteration, first, estimate €; by inserting the 
initial estimates of the parameters into the right-hand side of (10.10.33) (de- 
note the estimate thus obtained as 2); second, minimize 


5 (w; — P) $; (w, — P,), (10.10.34) 


where the minus sign in the superscript denotes a generalized inverse, with 
respect to the parameters that characterize P,, and repeat the process until the 
estimates converge. A generalized inverse A^ of A is any matrix that satisfies 
AA” A = A (Rao, 1973, p. 24). A generalized inverse 2; is obtained from the 
matrix D;! — P! ll^, where lis a vector of ones, by replacing its kth column 
and row by a zero vector. It is not unique because we may choose any k. 

Finally, regarding step 3, if we denote the two estimates of $ obtained 
by steps 1 and 2 by fj, and £,, respectively, and their respective asymp- 
totic variance-covariance matrices by V, and V;, the optimal linear com- 
bination of the two estimates is given by (Vj! - Vj!) !Vi! B 
(Vi! + Vily!Vi! Ê. This final estimator is asymptotically not fully effi- 
cient, however. To see this, suppose the regression coefficients of (10.10.26) 
and (10.10.27) differ: Call them £, and £, , say. Then, by a result of Amemiya 
(1976b), we know that f. is an asymptotically efficient estimator of $}. How- 
ever, as we have indicated in Section 10.4.4, f, is not asymptotically efficient. 
So a weighted average of the two could not be asymptotically efficient. 
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Dubin and McFadden (1984) used a model similar to that of Duncan in 
their study of the joint determination of the choice of electric appliances and 
the consumption of electricity. In their model, 5,(k) may be interpreted as the 
utility of the ith family when they use the kth portfolio of appliances, and y;(k) 
as the consumption of electricity for the ith person holding the kth portfolio. 
The estimation method is essentially similar to Duncan's method. The main 
difference is that Dubin and McFadden assumed that the error terms of 
(10.10.26) and (10.10.27) are distributed with Type I extreme-value distribu- 
tion and hence that the P part of (10.10.29) is multinomial logit. 


Exercises 


l. 


(Section 10.4.3) 
Verify (10.4.19). 


. (Section 10.4.3) 


Verify (10.4.28). 


. (Section 10.4.3) 


Consider Vy, and V, given in (10.4.32) and (10.4.33). As stated in 
the text, the difference of the two matrices is neither positive definite 
nor negative definite. Show that the first part of Vfw, namely, 
o? (Z' X^'Zy'!, is smaller than V, in the matrix sense. 


. (Section 10.4.5) 


In the standard Tobit model (10.2.3), assume that c? = 1, £ is a scalar and 
the only unknown parameter, and (x;) are i.i.d. binary random variables 
taking 1 with probability p and 0 with probability 1 — p. Derive the 
formulae of p - AV [Vn(f — f)] for $ = Probit MLE, Tobit MLE, Heck- 
man’s LS, and NLLS. Evaluate them for 6 = 0, 1, and 2. 


. (Section 10.4.6) 


Consider the following model: 
y71 if yfz0 
=0 if yf «0, i-1,2,...,n, 


where ( yf) are independent A(x/£, 1). Itis assumed that ( y,} are observed 
but (yf) are not. Write a step-by-step instruction of the EM algorithm to 
obtain the MLE of fl and show that the MLE is an equilibrium solution of 
the iteration. 
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6. (Section 10.6) 
Consider the following model: 


Yu yh if y420 
- if y$s0 
Yo = 1 if y420 
= if yS0, i=1,2,...,2, 


where (1,;, ux) are i.i.d. with the continuous density f( - , + ). Denote the 
marginal density of u; by f, ( * ) and that of uz by ( * ). 

a. Assuming that yı, Yñ, Xu and x, are observed for i=l, 
2,...,n,express the likelihood function in terms of f, fı, and f. 

b. Assuming that y,;, Yz, Xii, and xy are observed for all i, express the 
likelihood function in terms of f, f, and fz. 


7. (Section 10.6) 
Consider the following model: 


yf-aztu, 


zf = pyt n 

y =1 if yfzO 
=0 if yf«0 

z =] if 720 
=0 if zř <0, 


where u; and v, are jointly normal with zero means and nonzero covar- 
iance. Assume that y*, z*, u, and v are unobservable and y and z are 
observable. Show that the model makes sense (that is, yand z are uniquely 
determined as functions of u and v) if and only if a f = 0. 


8. (Section 10.6) 
In the model of Exercise 7, assume that f = 0 and that we have n i.i.d. 
observations on (yj, z), i= 1,2,. . . , n. Write the likelihood function 
of a. You may write the joint density of (u, v) as simply f(u, v) without 
explicitly writing the bivariate normal density. 
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9. 


10. 


12. 


(Section 10.6) 

Suppose yf and zt, i= 1,2,. . . , n, are i.i.d. and jointly normally dis- 
tributed with nonzero correlation. For each i, we observe (1) only yf, (2) 
only z*, or (3) neither, according to the following scheme: 


(1) Observe y¥ = y; and do not observe zf if yf = z* = 0. 

(2) Observe z* = z; and do not observe y; if zf > y¥ = 0. 

(3) Do not observe either if y* < 0 or z* « 0. 

Write down the likelihood function of the model. You may write the joint 


normal density simply as f(-,-). 


Section (10.7.1) 
Write the likelihood function of the following two models (cf. Cragg, 
1971). 

a. (yb ył) ~ Bivariate N(xif,, x)fb, 1, 6$, 012) 


y;7yT if yf>0 and yj»0 
=0 otherwise. 


We observe only jp. 


b. (yf, y$) ~ Bivariate N(x;f,, x; f,, 1, 02, 0,2) with y?truncated so 
that yž> 0 


y»7yli if yf>0 
=0 if yts0 


We observe only yz. 


. (Section 10.9.4) 


In Tomes’ model defined by (10.9.7) through (10.9.9), consider the fol- 
lowing estimation method: Step 1. Regress y,;on x,;and xand obtain the 
least squares predictor $,,. Step 2. Substitute f;, for y; in (10.9.7) and 
apply the Tobit MLE to Eqs. (10.9.7) and (10.9.9). Will this method yield 
consistent estimates of y, and 8,? 


(Section 10.10) 

Suppose the joint distribution of a binary variable w and a continuous 
variable y is determined by P(w=1|y)=A(y,y) and f(ylw) = 
N(y,w, a7). Show that we must assume oy, = y; for logical consistency. 


13. 


14. 
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(Section 10.10.1) 
In model (10.10.1), Type 5 Tobit, define an observed variable y; by 


y7yL if yt^0 
=y3 if yts 


and assume that a researcher does not observe whether y?, > 0 or = 0; 
that is, the sample separation is unknown. Write the likelihood function 
of this model. 


(Section 10.10.4) 

Let (yù, y$, y$) be a three-dimensional vector of continuous random 
variables that are independent across i= 1, 2,. . . , n but may be corre- 
lated among themselves for each i. These random variables are unob- 
served; instead, we observe z, and y, defined as follows 


Z= ybi if yf,>0 
= y if yf S0. 


.10 with probability 4 
Yi 1 with probability 1 — A 


= 0 if y s0. 
Write down the likelihood function. Use the following symbols: 


ify&»0 


Jat yA)  jeintdensity of yf, and y3, 
Sfat y) joint density of yf, and yf. 


. (Section 10.10.4) 


Consider a regression model: yf, = x1,f, + uu and y$ = x;,f, + tz, 
where the observable random variable y, is defined by y; = yf, with proba- 
bility A and y, — y; with probability 1 — A. This is called a switching 
regression model. Write down the likelihood function of the model, as- 
suming that (1,,, u) are i.i.d. with joint density /( + , - ). 


. (Section 10.10.6) 


Show 2,2; 2; = X, where Z, is given in (10.10.33) and Z; is given after 
(10.10.34). Let w¥ and P? be the vectors obtained by eliminating the kth 
element from w, and P,, where k can be arbitrary, and let Xf be the 
variance-covariance matrix of wf. Then show (w; — P,)’ X; (w; — Pj) = 
(w? — P?) (27y (w? — P7). 


11 Markov Chain and Duration Models 


We can use the term time series models in a broad sense to mean statistical 
models that specify how the distribution of random variables observed over 
time depends on their past observations. Thus defined, Markov chain models 
and duration models, as well as the models discussed in Chapter 5, are special 
cases of time series models. However, time series models in a narrow sense 
refer to the models of Chapter 5, in which random variables take on continu- 
ous values and are observed at discrete times. Thus we may characterize the 
models of Chapter 5 as continuous-state, discrete-time models. Continuous- 
state, continuous-time models also constitute an important class of models, 
although we have not discussed them. In contrast, Markov chain models (or, 
more simply, Markov models) may be characterized as discrete-state, dis- 
crete-time models, and duration models (or, survival models) as discrete-state, 
continuous-time models. In this chapter we shall take up these two models in 
turn. 

" The reader who wishes to pursue either of these topics in greater detail than 
is presented here should consult the textbooks by Bartholomew (1982) for 
Markov models and by Kalbfleisch and Prentice (1980) or Miller (1981) for 
duration models. For recent results on duration models with econometric 
applications, see Heckman and Singer (1984b). 


11.1 Markov Chain Models 
11.1.1 Basic Theory 


Define a sequence of binary random variables 


yit) =1 if ith person is in state j at time ! (11.1.1) 
-— (0 otherwise, 
i—-1,2,...,N, t=1,2,...,7, 


j=1,2,...,M. 
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Markov chain models specify the probability distribution of yj(t) as a func- 
tion of yi(s), kK=1,2,...,Mands=t—1,t-—2,.. . , as well as (possi- 
bly) of exogenous variables. 

Markov chain models can be regarded as generalizations of qualitative 
response models. As noted in Section 9.7, Markov chain models reduce to QR 
models if y'(t) are independent over t. In fact, we have already discussed one 
type of Markov models in Section 9.7.2. 

Models where the distribution of y/(r) depends on y(t — 1) but not on 
yk(t — 2), yi(t — 3, . . . are called first-order Markov models. We shall pri- 
marily discuss such models, although higher-order Markov models will also 
be discussed briefly. 

First-order Markov models are completely characterized if we specify the 
transition probabilities defined by 


P, (t) = Prob [ith person is in state k at time t given that — (11.1.2) 
he was in state j at time ¢ — 1] 


and the distribution of y/(0), the initial conditions. 
The following symbols will be needed for our discussion: 


y(t) = M-vector the jth element of which is y(t) (11.1.3) 
N a 
nt) = 2 yi) 
N : 
nyt) = Y, ya — Dyk(t) 
i=] 
T " 
Ny = 2 yyt — yi 
t€ 


T N . 
ng — p ny(t) = 2 nj 
P(r) = (Pi,(0), an M X M matrix 
pi(t) = Prob[ith person is in state j at time 1] 
p(t) = M-vector the jth element of which is pj(t). 


The matrix P'(r) is called a Markov matrix. It has the following properties: 
(1) Every element of P(z) is nonnegative. (2) The sum of each row is unity. (In 
other words, if 1 is an M-vector of ones, then P0) = 1.) 

If yj(0) is a binary random variable taking the value of 1 with probability 
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pi(0), the likelihood function of the first-order Markov model can be written 
as 


L= I Il II H Pi, (7090 . II I pi)», (11.1.4) 


where t ranges from 1 to T, i ranges from 1 to N, and kand j range from 1 to M 
unless otherwise noted. If the initial values y/(0) are assumed to be known 
constants, the last product term of the right-hand side of (11.1.4) should be 
dropped. 

Clearly, all the parameters P, (t) and pj(0) cannot be estimated consist- 
ently. Therefore they are generally specified as functions ofa parameter vector 
0, where the number of elements in @ is either fixed or goes to v at a sufficiently 
slow rate (compared to the sample size). Later we shall discuss various ways to 
parameterize the transition probabilities P4, (1). 

A Markov model where P, (t) = Pi, for all t is called the stationary Markov 
model. If Pi,(t) = Pj, (t) for all i, the model is called homogeneous. (Its anto- 
nym is heterogeneous.) A parameterization similar to the one used in QR 
models is to specify Pi,(t) = F,[x'(t)’B] for some functions Fẹ such that 
ZILLF, = 1. Examples will be given in Sections 11.1.3 and 11.1.4. The case 
where y/(t) are independent over ¢ (the case of pure QR models) is a special 
case of the first-order Markov model obtained by setting Pi, (f) = P (t) for all 
jand j’ for each j and t. 

For QR models we defined the nonlinear regression models: (9.2.26) for the 
binary case and (9.3.16) for the multinomial case. A similar representation for 
the first-order Markov model can be written as 


EtyQOlyt — 1), yt — 2). . J= Ptyy(t — 1) (11.1.5) 
or 
y) = P(Y y(t — 1) + ud). (11.1.6) 


Because these M equations are linearly dependent (their sum is 1), we elimi- 
nate the Mth equation and write the remaining M — 1 equations as 


y(t) = P'(y yt — 1) + 00). (11.1.7) 


Conditional on y(t — 1), y(t — 2). . . , we have Eu'(t) - 0 and Vu'() = 
D(4) — uu’, where u = P(1)’y(t — 1) and D() is the diagonal matrix with the 
elements of p in the diagonal. Strictly speaking, the analog of (9.3.16) is 
(11.1.7) rather than (11.1.6) because in (9.3.16) a redundant equation has 
been eliminated. 
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_ Asin QR models, the NLGLS estimator of the parameters that characterize 
P'(t), derived from (11.1.7), yields asymptotically efficient estimates. The 
presence of y‘(t — 1) in the right-hand side of the equation does not cause any 
problem asymptotically. This is analogous to the fact that the properties of the 
least squares estimator in the classical regression model (Model 1) hold 
asymptotically for the autoregressive models discussed in Chapter 5. We shall 
discuss the NLWLS estimator in greater detail in Section 11.1.3. There we 
shall consider a two-state Markov model in which P(t) depends on exogenous 
variables in a specific way. 

As in QR models, minimum chi-square estimation is possible for Markov 
models in certain cases. We shall discuss these cases in Section 11.1.3. 

Taking the expectation of both sides of (11.1.5) yields 


p(t) = Piy pt — 1). (11.1.8) 


It is instructive to rewrite the likelihood function (11.1.4) as a function of 
P5, (t) and p'(r) as follows. Because 


" Pi.) 
t7 1»40 = TT | Le? 
T] P" Il Ec 


we can write (11.1.4) alternatively as 


Pr) ye-ty,@ T 
L- non |e? 10) Hao aao 


i=0 
-L:L. 


If we specify Pi, (£) and p/(0) to be functions of a parameter vector 9, then by 
(11.1.8) L; is also a function of 0. The partial likelihood function L,(@) has the 
same form as the likelihood function of a QR model, and maximizing it will 
yield consistent but generally asymptotically inefficient estimates of 8. 

As we noted earlier, if y/(7) are independent over t, the rows of the matrix 
P'(r) are identical. Then, using (11.1.8), we readily see P/,(t) = pi(t). There- 
fore L = L,, implying that the likelihood function of a Markov model is 
reduced to the likelihood function of a QR model. 

Because p(t) is generally a complicated function ofthe transition probabili- 
ties, maximizing L, cannot be recommended as a practical estimation 
method. However, there is an exception (aside from the independence case 
mentioned earlier), that is, the case when p/(0) are equilibrium probabilities. 

The notion of equilibrium probability is an important concept for station- 
ary Markov models. Consider a typical individual and therefore drop the 


jay ; 
: pi (n, (11.1.9) 
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superscript i from Eq. (11.1.8). Under the stationarity assumption we have 


p(t) = P'p(t — 1). (11.1.11) 
By repeated substitution we obtain from (11.1.11) 

p(t) = (P’)‘p(0). (11.1.12) 
if lim,_.. (P^ exists, then 

p() = (P’)"p(0). (11.1.13) 


We call the elements of p(») equilibrium probabilities. They exist if every 
element of P is positive. 

It is easy to prove (Bellman, 1970, p. 269) that if every element of P is 
positive, the largest (in absolute value or modulus) characteristic root of P is 
unity and is unique. Therefore, by Theorem 1 of Appendix 1, there exist a. 
matrix H and a Jordan canonical form D such that P^ = HDH !. Therefore 
we obtain 


(P^^-HD*H'-HJH'!, (11.1.14) 


where J is the M X M matrix consisting of 1 in the northwestern corner and 0 
elsewhere. Equilibrium probabilities, if they exist, must satisfy 


p(»») = P’p(~), (11.1.15) 


which implies that the first column of H is p(o») and hence the first row of H ^! 
is the transpose of the M-vector of unity denoted I. Therefore, from (11.1.14), 


(P) = pW. (11.1.16) 


Inserting (11.1.16) into (11.1.13) yields the identity p(o») = p()I’p(0) = p() 
for any value of p(0). If p(oo) exists, it can be determined by solving (11.1.15) 
subject to the constraint l'p(o:) = 1. Because the rank of I — P’ is M — 1 under 
the assumption, the p(cc) thus determined is unique. 

If pi(0) = p}(~), L, reduces to 


Lt= Il H pio) 3-71, (11.1.17) 
i j 


which is the likelihood function ofa standard multinomial QR model. Even if 
pi(0) # p}(~), maximizing L? yields a consistent estimate as T goes to infinity. 
We shall show this in a simple case in Section 11.1.3. 

Now consider the simplest case of homogeneous and stationary Markov 
models characterized by 
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Pi()- P, forall i and t. (11.1.18) 
The likelihood function (11.1.4) conditional on yi(0) is reduced to 

L- HIT (11.1.19) 
It is to be maximized subject to M constraints 244, P4 = 1,/ = 1, 2,. , M. 
Consider the Lagrangian 


$= Y Z miog Pa- Xi(X5-1). (11.1.20) 
J J 
Setting the derivative of S with respect to P, equal to 0 yields 

= Py. (11.1.21) 


summing b both sides of (11.1.21) over k and using the constraints, we obtain 
the MLE 


B,- mF Nir- (11.1.22) 


See Anderson and Goodman (1957) for the asymptotic properties of the MLE 
(11.1.22). 

Anderson and Goodman also discussed the test of various hypotheses in the 
homogeneous stationary Markov model. Suppose we want to test the null 
hypothesis that Pj is equal to a certain (nonzero) specified value Pj, for 


k=1,2,..., Mand fora particular j. Then, using a derivation similar to 
(9.3.24), we can show 
d (B, — PSY a 
s-( n, > Gg PIO a is (11.1.23) 
y 2 7 A P3 ~ Xiri 
where P, xis the MLE. Furthermore, if P$ is given forj = 1, 2, . , Mas well 


as k, we can use the test statistic E% S, “which i is asymptotically distributed as 
chi-square with M(M — 1) degrees of freedom. Next, suppose we want to test 
(11.1.18) itselfagainst a homogeneous but nonstationary model characterized 

Pi,(t) = P(t). This can be tested by the likelihood ratio test statistic with 
the following distribution: 


—2 log II II II [5,/ B, P (Qo ~ Xir- MU)» (11.1.24) 


where PB, = = ny (0/Zi. ny (2). 
In the same article Anderson and Goodman also discussed a test of the 
first-order assumption against a second-order Markov chain. 
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11.1.2 Empirical Examples of Markov Models without 
Exogenous Variables 


In this subsection we shall discuss several empirical articles, in which Markov 
chain models without exogenous variables are estimated, for the purpose of 
illustrating some of the theoretical points discussed in the preceding subsec- 
tion. We shall also consider certain new theoretical problems that are likely to 
be encountered in practice. 

Suppose we assume a homogeneous stationary first-order Markov model 
and estimate the Markov matrix (the matrix of transition probabilities) by the 
MLE D, given by (11.1.22). Let PX? bea transition probability of lag two; that 
is, PẸ denotes the probability a person isin state kat time t given that he or she 
was in state jattime t — 2. If we define nit) = = XL yi(t— 2)yi() and n9 = 
Zanr), a consistent estimate of PQ is also given by PO? = nO/23¥ n, 
Now, tf our assumption is correct, we Should have approximate y 


PO = pr, (11.1.25) 


where P? = PP. Many studies of the mobility of people among classes of 
income, social status, or occupation have shown the invalidity of the approxi- 
mate equality (11.1.25). There is a tendency for the diagonal elements of PO 
to be larger than those of P? (see, for example, Bartholomew, 1982, Chapter 2). 
This phenomenon may be attributable to a number of reasons. Two empirical 
articles have addressed this issue: McCall (1971) explained it by population 
heterogeneity, whereas Shorrocks (1976) attributed the phenomenon in his 
data to a violation of the first-order assumption itself. 

McCall (1971) analyzed a Markov chain model of income mobility, where 
the dependent variable is classified into three states: low income, high income, 
and unknown. McCall estimated a model for each age—sex — race combina- 
tion so that he did not need to include the exogenous variables that represent 
these characteristics. Using the mover-stayer model (initially proposed by 
Blumen, Kogan, and McCarthy, 1955, and theoretically developed by Good- 
man, 1961), McCall postulated that a proportion S, of people, j = 1, 2, and 3, 
stay in state j throughout the sample period and the remaining population 
follows a nonstationary first-order Markov model. Let V(t) be the probabil- 
ity a mover is in state kat time t given that he or she was in state j at time? — 1. 
Then the transition probabilities of a given individual, unidentified to be 
either a stayer or a mover, are given by 


Py) - Sj (1 — S)V,() and (11.1.26) 
P4) (0 — SV). if jk. 
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McCall assumed that V;, (f) depends on t (nonstationarity) because of eco- 
nomic growth, but he got around the nonstationarity problem simply by 
estimating V;,(t) for each ¢ separately. He used the simplest among several 
methods studied by Goodman (1961). Stayers in state j are identified as people 
who remained in state j throughout the sample periods. Once each individual 
is identified to be either a stayer or a mover, S, and V, (t) can be estimated in a 
straightforward manner. This method is good only when there are many 
periods. If the number of periods is small (in McCall’s data T = 10), this 
method will produce bias (even if n is large) because a proportion of those who 
stayed ina single state throughout the sample periods may actually be movers. 
After obtaining estimates of V(t), McCall regressed them on a variable repre- 
senting economic growth to see how economic growth influences income 
mobility. 

For a stationary Markov model, Goodman discussed several estimates of S; 
and V, that are consistent (as n goes to infinity) even if T is small (provided 
T> 1). We shall mention only one simple consistent estimator. By defining 
matrices V = (V), P = (P), and a diagonal matrix S = D(S;), we can write 
(11.1.26) as 


P=S+(I—S)V. (11.1.27) 
The matrix of transition probabilities of lag two is given by 
p?-$S-(I—-S)yV?. (11.1.28) 


Now, P and P?! can be consistently estimated by the MLE mentioned earlier. 
Inserting the MLE into the left-hand side of (11.1.27) and (11.1.28) gives us 
2M(M — 1) equations. But since there are only M? parameters to estimate in S 
and V, solving M? equations out of the 2M(M — 1) equations for S and V will 
yield consistent estimates. 

The empirical phenomenon mentioned earlier can be explained by the 
mover -stayer model as follows: From (11.1.27) and (11.1.28) we obtain after 
some manipulation 


po —p-S-(I-S)V?-[S-(I-SYVIIS + (1 — SV] 
- (I — SXI — V)S(I — V). (11.1.29) 


Therefore the diagonal elements of P? — P? are positive if the diagonal 
elements of (I — V)S(I — V) are positive. But the jth diagonal element of 
(I — V)S(I — V) is equal to 
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M 
S(1 =V + Y, S VuVu 
kvj 


which is positive. 

Shorrocks (1976) accounted for the invalidity of (11.1.25) in a study of 
income mobility by postulating a second-order Markov model. Depending on 
theinitial conditions andthe values ofthe parameters, a second-order Markov 
model can lead to a situation where the diagonal elements of P9 are larger 
than the corresponding elements of P?. 

The likelihood function of a second-order Markov model conditional on 
the initial values yj(— 1) and y/(0) is given by 


LL 0 [1 I [Į Peo meio, (11.1.30) 
t i j k I 


where Pi,,(t) is the probability the ith person is in state / at time t given that he 
or she was in state j at time t — 2 and in state k at time t — 1. If homogeneity 
and stationarity are assumed, then Piy(t) = Piu. Even then the model con- 
tains M?(M — 1) parameters to estimate. Shorrocks grouped income into five 
classes (M = 5), thus implying 100 parameters. By assuming the Champer- 
nowne process (Champernowne, 1953), where income mobility at each time 
change is restricted to the three possibilities — staying in the same class or 
moving up or down to an adjacent income class, he reduced the number of 
parameters to six, which he estimated by ML. We can see this as follows: Let 
n, č = — 1,0, 1 represent the three possible movements. Then the Champer- 
nowne process is characterized by P,,, the probability a person moves by ¢ 
from t — 1 tor given that he or she moved by 7 from t — 2 tot — 1. Notice that 
the model is reduced to a first-order Markov chain with M — 3. 

For the purpose of illustrating the usefulness of equilibrium probabilities in 
an empirical problem and of introducing the problem of entry and exit, let us 
consider a model presented by Adelman (1958). Adelman analyzed the size 
distribution of firms in the steel industry in the United States using the data for 
the periods 1929-1939 and 1945 — 1956 (excluding the war years). Firms are 
grouped into six size classes according to the dollar values of the firms’ total 
assets. In addition, state 0 is affixed to represent the state of being out of the 
industry. Adelman assumed a homogeneous stationary first-order model. 

Movements into and from state 0 are called exit and entry, and they create a 
special problem because the number of firms in state 0 at any particular time is 
not observable. In our notation it means that 2f oi is not observable, and 
hence, MLE Py, = n,,/X,ng cannot be evaluated. Adelman circumvented the 


Markov Chain and Duration Models 421 


problem by setting £f. ono, = 100,000 arbitrarily. We shall see that although 
changing Ef og changes the estimates of the transition probabilities, it does 
not affect the equilibrium relative size distribution of firms (except relative to 
size 0). 

Let p = p() be a vector of equilibrium probabilities. Then, as was shown 
earlier, p can be obtained by solving 


| Pr] p-[9], (11.1.31) 


where * means eliminating the first row ofthe matrix. Now, consider p;/p, for 
J, k #0, where p; and p, are solved from (11. L. 31) by Cramer’s rule. Because 
Xf orig; affects only the first column of [I — P’]* proportionally, it does not 
affect p,/D,. 

Duncan and Lin (1972) criticized Adelman’s model, saying that it is unreal- 
istic to suppose a homogeneous pool of firms in state 0 because a firm that 
once goes out of business is not likely to come back. Duncan and Lin solved 
the problem by treating entry and exit separately. Exit is assumed to be an 
absorbing state. Suppose j = 1 is an absorbing state, then P,, = 1 and P, = O0 


for k=2,3,...,M. Entry into state k—m,(f) firms at time t, k= 
2,3,... , M—is assumed to follow a Poisson distribution: 
F,(t) = uy (ry «9e Dm, (1] 7!. (11.1.32) 


Then P, and 4, are estimated by maximizing the likelihood function 


L*-L- È X F,(0), (11.1.33) 


where L is as given in (11.1.4). This model is applied to data on five classes of 
banks according to the ratio of farm loans to total loans. Maximum likelihood 
estimates are obtained, and a test of stationarity is performed following the 
Anderson-Goodman methodology. 


11.1.3 Two-State Models with Exogenous Variables 


We shall consider a two-state Markov model with exogenous variables, which 
accounts for the heterogeneity and nonstationarity of the data. This model is 
closely related to the models considered in Section 9.7.2. We shall also discuss 
an example of the model, attributed to Boskin and Nold (1975), to illustrate 
several important points. 
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This model is similar to a univariate binary QR model. To make the subse- 
quent discussion comparable to the discussion of Chapter 9, assume that j = 0 
or 1 rather than 1 or 2. Let y, — 1 if the ith person is in state 1 at time t and 
yy = 0 otherwise. (Note that y; is the same as the yi(1) we wrote earlier.) The 
model then can be written as 


P(y,— lyj-i) = F(f'x;, + Q' Xa; 1), (11.1.34) 
where F is a certain distribution function. Note that (11.1.34) is equivalent to 
Po (0) = F(f'x;) (11.1.35) 


(2) = Fla + Byx;]. 
Writing the model as (11.1.34) rather than as (11.1.35) makes clearer the 
similarity of this model to a QR model. The model defined by (9.7.11) isa 
special case of (11.1.34) if {vx} in (9.7.11) are Lid. 
Using (11.1.4), we can write the likelihood function of the model condi- 
tional on y, as 


L= [I IT Pu (Piso (11.1.36) 
i t 


X Pi (Df «o» pi (HOTI Xyu), 


However, the similarity to a QR model becomes clearer if we write it alterna- 
tively as 


b=] TI peu = Ft», (11.1.37) 


where Fi, = F(f'x, + a'x,y;,—1). 

Because we can treat (i, t) as a single index, model (11.1.34) differs from a 
binary QR model only by the presence of y,,_, in the argument of F. But, as 
mentioned earlier, its presence causes no more difficulty than it does in the 
continuous-variable autoregressive model; we can treat y, ,. , as if it were an 
exogenous variable so far as the asymptotic results are concerned. Thus, from 
(9.2.17) we can conclude that the MLE of y = (f^, a’)’ follows 


a . 1] fi 7 
VNT ($ — y) ^ N 40, | plim NT 2 » FE -EF) Wi Wi, ; 
(11.1.38) 
where f, is the derivative of F, and 


Wi " 
Xy Vir-1 
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The equivalence of the NLWLS iteration to the method of scoring also holds 
for this model. 

Similarly, the minimum chi-square estimator can also be defined for this 
model as in Section 9.2.5. It is applicable when there are many observations of 
Yn With the same value of x,. Although a more general grouping can be 
handled by the subsequent analysis, we assume X; = x, for every i, so that yy, 
i=1,2,... , N, are associated with the same vector x,. 

Define 


N 
> Yall — Yi) 


ppm EE (11.1.39) 
> (1 — Yie-1) 
im] 
and 
N 
. > Vit Vit-1 
p}= 1 ___ . (11.1.40) 
> Vie-s 
i-i 
Then we have 
F-W(f9)-f'x--&, t=1,2,...,T (11.1.41) 
and 
F-\(P}))=(at+ Byx,+, t=1,2,...,T. (11.1.42) 
The error terms ¢, and n, approximately have zero means, and their respective 
conditional variances given y,,.,, Í= 1,2, . . . , N, are approximately 
Q1 .— pO 
Vie.) = Pa P) (11.1.43) 


N 
FLU » (1 — Y) 
and 
P;(1— Pi) 
N 
FF) 2 Vir 


where P? = Pi, (t) and P1 = P} (t). The MIN x? estimator of y is the weighted 
least squares estimator applied simultaneously to the heteroscedastic regres- 


Vin) = (11.1.44) 


> 
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sion equations (11.1.41) and (11.1.42). As in QR models, this estimator has 
the same asymptotic distribution as the MLE given in (11.1.38), provided that 
N goes to œ for a fixed T. 

To illustrate further the specific features of the two-state Markov model, we 
shall examine a model presented by Boskin and Nold (1975). In this model, 
j= 1 represents the state of an individual being on welfare and j = 0 the state 
of the individual being off welfare. Boskin and Nold postulated 


Pig (t) = A(a'x;) = A; (11.1.45) 
Pi, (0) = A(f'x)) = B;, 


where A(x) = (1 + e7*)^!, a logistic distribution. The model can be equiva- 
lently defined by 


P(Y = WY e-1) = A[f'x; — (a + BYxiy;.i]. (11.1.46) 


Thus we see that the Boskin-Nold model is a special case of (11.1.34). 

Note that the exogenous variables in the Boskin-Nold model do not depend 
on t, a condition indicating that their model is heterogeneous but stationary. 
The exogenous variables are dummy variables characterizing the economic 
and demographic characteristics of the individuals. The model is applied to 
data on 440 households (all of which were on welfare initially) during a 
60-month period (thus ¢ denotes a month). 

Because of stationarity the likelihood function (11.1.36) can be simplified 
for this model as 


L-] Amo 1 — A;y'.BT(1 — B,y*v. (11.1.47) 


It is interesting to evaluate the equilibrium probability —the probability 
that a given individual is on welfare after a sufficiently long time has elapsed 
since the initial time. By considering a particular individual and therefore 
omitting the subscript i, the Markov matrix of the Boskin-Nold model can be 
written as 


, [1-4 B 
e- [2^ 15] T 
Solving (11.1.15) together with the constraint I’p(~) = 1 yields 
n B A 
p(o») -225.245]. (11.1.49) 


The first component ofthe vector signifies the equilibrium probability that the 
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person is on welfare. It increases proportionally with the transition probability 
Po (= B). 

It is instructive to derive the characteristic roots and vectors of P’ given in 
(11.1.48). Solving the determinantal equation 


1—4—4 B 
«| A i54]? 


yields the characteristic roots A, = 1 and A, = 1 — A — B. Solving 


» Al h= H 


yields (the solution is not unique) the first characteristic vector h, = (B, A)’. 
Solving 


yields the second characteristic vector h; — (— 1, 1)'. Therefore 


[B -1 ac ad fit 
n=l d and H -ralh J 


Using these results, we obtain 


ey =tim Af a- 2 MES (11.1.50) 
-uf ofa: 
.B B 
_| 448 A+B 
dA AJ 
A+B A+B 


which, of course, could have been obtained directly from (11.1.16) and 
(11.1.49). 

Although the asymptotic variance-covariance matrix of the MLE & and Bin 
the Boskin-Nold model can be derived from (11.1. 38), we can derive it directly 
using the likelihood function (11.1.47). We shall do so only for à because the 
derivation for fj is similar. We shall assume T — œ, for this assumption en- 
ables us to obtain a simple formula. 
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We need to consider only the part of (11.1.47) that involves 4;; its loga- 
rithm is 
log L = Y nio log A; + Y, nj, log (1 — Aj). (11.1.51) 
i i 


Differentiating (11.1.51) with respect to œ, we obtain 


ð log L nio 9A; ni, ôA; 
x = Ge 5 1-4 ja (11.1.52) 
i i i i 
and 
logL _ nio ni 94, 0A, 
Sada? z| ;t'ü-4p ZAF | aa Bar (11.1.53) 
nio ni, | ^A, 
+ ————— . 
» 5 l —A, dada’ 


To take the expectation of (11.1.53), we need to evaluate En{, and En!,. We 
have 


wy 


Enio = ED yit — vate) (11.1.54) 
t=] 


= b Zr) 


t=0 
T—1 
= (1, 0) lx e| P(0)Pio- 


If we define D to be the 2 X 2 diagonal matrix consisting of the characteristic 
roots of P’, we have from (11.1.50) 


T-1 T-1 
> Py-H b p| H^! (11.1.55) 
f-0 t= 


~ T |B B 

A+BLA AJ’ 
where the approximation is valid for large 7. Inserting (11.1.55) into (11.1.54) 
yields 


TA,B, 


Eni t E By 


(11.1.56) 
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Similarly, 
i = T 7 AB, 
Eni, = LFB, C (11.1.57) 
From (11.1.53), (11.1.56), and (11.1.57) we obtain 
plL AU —A)B,. ., 
E^ dom > ate 4B, MMe (11.1.58) 
Hence,! 
ay A(1 — A,)B, 7 
AV(&) [r TÈ- IFB ATB, NM|- (11.1.59) 
Similarly, 
OL B,(1 — BQ — BiA; 
AV(B) | TELTATB 4,3 B, xx| . (11.1.60) 


Next, let us consider the consistency ofthe estimator ofa and f obtained by 
maximizing L? defined in (11.1.17). Using (11.1.50), we can write the loga- 
rithm of (11.1.17) in the Boskin-Nold model as 


log L3= Y 5 vio] log [B,/(A, + B;)] (11.1.61) 


T 
+5 [z w| log [4,/(4, + B.) 


Defining y'(t) = [v1 (0, y$()]', we have 


2014, tno, 
— 4 E — d EH .1.62 
plim TÀ y(t) = plim T (P’)'y (0) (11.1.62) 
1 By 
~ Ay + Ba PM 


where the subscript 0 denotes the true value. Therefore we obtain 


plim cz l log Lt (1 1.1.63) 


T 


. y B, A 
= —i _ — 2o 
ix [tg te «(z e) t atn (ag) 


from which we can conclude the consistency of the estimator. 
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We can now introduce in the context of the Boskin-Nold model an impor- 
tant concept in the Markov model called duration. Let t, be the number of 
months a particular individual stays on welfare starting from the beginning of 
the sample period. It is called the duration of the first spell of welfare. Its 
probability is defined as the probability that the individual stays on welfare up 
to time ¢, and then moves off welfare at time ¢,. Therefore 


P(t) - (1 — Ay" A. (11.1.64) 
The mean duration can be evaluated as 
w a ð 
= — t-l 4 = — 
Et, PES AJA Ada” (11.1.65) 
249 m nz 4,9, 7 
“As 2 r ori-—r 
-_4A 2l 
(1—r? A’ 


where r= 1 — A. In words, this equation says that the mean duration on 
welfare is the inverse of the probability of moving off welfare. 

Suppose the ith person experiences H welfare spells of duration 
ti,t,... , tí and K off-welfare spells of duration si, si,.. . , sk. If we 
generalize the Boskin-Nold model and let the ith person’s transition probabil- 
ities 4; and B; vary with spells (but stay constant during each spell), the 
likelihood function can be written as 


L=]] { il [1 — A;(0)] 7 AR) il [1 — s (opa. (11.1.66) 
i thsi kel 


Equation (11.1.66) collapses to (11.1.47) if 4; and B;do not depend on hand k 
and therefore is of intermediate generality between (11.1.36) and (11.1.47). 
Expressing the likelihood function in terms of duration is especially useful in 
the continuous-time Markov model, which we shall study in Section 11.2. 


11.1.4 Multistate Models with Exogenous Variables 


Theoretically, not much need be said about this model beyond what we have 
discussed in Section 11.1.1 for the general case and in Section 11.1.3 for the 
two-state case. The likelihood function can be derived from (11.1.4) by speci- 
fying P, (t) as a function of exogenous variables and parameters. The equiva- 
lence of the NLWLS to the method of scoring iteration was discussed for the 
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general case in Section 11.1.1, and the minimum chi-square estimator defined 
for the two-state case in Section 11.1.3 can be straightforwardly generalized to 
the multistate case. Therefore it should be sufficient to discuss an empirical 
article by Toikka (1976) as an illustration ofthe NLWLS (which in his case is a 
linear WLS estimator because of his linear probability specification). 

Toikka’s model is a three-state Markov model of labor market decisions in 
which the three states (corresponding to j = 1, 2, and 3) are the state of being 
employed, the state of being in the labor force (actively looking for a job) and 
unemployed, and the state of being out of the labor force. 

The exogenous variables used by Toikka consist of average (over individ- 
uals) income, average wage rate, and seasonal dummies, all of which depend 
on time (months) but not on individuals. Thus Toikka’s model is a homoge- 
neous and nonstationary Markov model. Moreover, Toikka assumed that 
transition probabilities depend linearly on the exogenous variables.? Thus, in 
his model, Eq. (11.1.7) can be written as 


yy = [y(¢—1Y © x JIT + ut, (11.1.67) 


which is a multivariate heteroscedastic linear regression equation. As we 
indicated in Section 11.1.1, the generalized least squares estimator of IT is 
asymptotically efficient. _ 

Let Y, be the N X M matrix the ith row of which is y'(f)' and let Y, be the 
N X (M — 1) matrix consisting of the first M — 1 columns of Y,. Define U, 
similarly. Then we can write (11.1.67) as 


Y, Y, O xj U, 
Y; Y, Ox U, 
J= ^ mj l (11.1.68) 
Y; Yr. © x}, U; 
The LS estimator Î is therefore given by 
a T -1 T _ 
II = » (Yi Y,- € «x)| Y Yn Y 9x) (11.1.69) 
tmj =i 
To define the FGLS estimator of II, write (11.1.68) as Y = XII + U and 
write the columns of Y, II, and U explicitly as Y=[y,, y2,. . . Ygl 
II = [2,,%,...,%,], and U=[u,,u,...,ug], where G=M—1. 
Also define y=(yi,y2,...,YG)’, T= (74, T3,... GY, and u= 


(uj, u$, . . . , ug)’. Then (11.1.68) can be written as 
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y=Xr+u. (11.1.70) 


The FGLS estimator of zt is (X/Q-X)-X'Q-!y, where Q is a consistent 
estimator of Q = Euu’. Here, Q has the following form: 


D; Da +++ Di 
D; D; D2¢ 


Q- ' ; (11.1.71) 


where each D, is a diagonal matrix of size NT. If each D, were a constant 
times the identity matrix, (11.1.70) would be Zellner's seemingly unrelated 
regression model (see Section 6.4), and therefore the LS estimator would be 
asymptotically efficient. In fact, however, the diagonal elements of D;, are not 
constant. - 

Toikka's estimator of IT, denoted II, is defined by 


~ T -1 T _ 
Il = È (0G xx) | » (Yt; Y-i)y'Yo,Y,O x] (11172) 
< £ 
Because (¥/_,Y,_,)~'Y/_,Y, is the first M — 1 columns of the unconstrained 
MLE of the Markov matrix P(t), Toikka's estimator can be interpreted as the 
LS estimator in the regression of P(t) on x,. Although this idea may seem 
intuitively appealing, Toikka's estimator is asymptotically neither more nor 
less efficient than II. Alternatively, Toikka's estimator can be interpreted as 
premultiplying (11.1.68) by the block-diagonal matrix 


(YoYo) Yo 
WY) Yi 


(Y-Y) Y7- 


and then applying least squares. If, instead, generalized least squares were 
applied in the last stage, the resulting estimator of TI would be identical with 
the GLS estimator of II derived from (11.1.68). 


11.1.5 Estimation Using Aggregate Data 


Up to now we have assumed that a complete history of each individual, y/(t) 
for every i, t, and j, is observed. In this subsection we shall assume that only the 
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aggregate data n,(t) = ZL, y/(t) are available. We shall first discuss LS and 
GLS estimators and then we shall discuss MLE briefly. 

Suppose the Markov matrix P(t) is constant across i, so that P(t) = P(r). 
Summing both sides of (11.1.7) over i yields 


Y y«0 —POY Y, y(t — 1) - Y, un. (11.1.73) 
i i i 
The conditional covariance matrix of the error term is given by 


2b un) Y y — n] = XD) - nail (11.1.74) 


i 


where u; = P(t)’y(t — 1). Depending on whether P(r) depends on unknown 
parameters linearly or nonlinearly, (11.1.73) defines a multivariate heterosce- 
dastic linear or nonlinear regression model. The parameters can be estimated 
by either LS or GLS (NLLS or NLGLS in the nonlinear case), and it should be 
straightforward to prove their consistency and asymptotic normality as NT 
goes to o». 

The simplest case occurs when Pt) is constant across both i and t, so that 
P(t) = P. If, moreover, P is unconstrained with M(M — 1) free parameters, 
(11.1.73) becomes a multivariate linear regression model. We can apply either 
a LS ora GLS method, the latter being asymptotically more efficient, as was 
explained in Section 11.1.4. See the article by Telser (1963) for an application 
ofthe LS estimator to an analysis ofthe market shares of three major cigarette 
brands in the period 1925— 1943. 

If P(t) varies with t, the ensuing model is generally nonlinear in parameters, 
except in Toikka's model. As we can see from (11.1.68), the estimation on the 
basis of aggregate data is possible by LS or GLS in Toikka's model, using the 
equation obtained by summing the rows of each Y,. A discussion of models 
where the elements of P(f) are nonlinear functions of exogenous variables and 
parameters can be found in an article by MacRae (1977). MacRae also dis- 
cussed maximum likelihood estimation and the estimation based on incom- 
plete sample. 

In the remainder of this subsection, we shall again consider a two-state 
Markov model (see Section 11.1.3). We shall present some of the results we 
have mentioned so far in more precise terms and shall derive the likelihood 
function. 

Using the same notation as that given in Section 11.1.3, we define r, — 
Ei, y,. The conditional mean and variance of r, given r,. , are given by 


N 
Er, — Y Fy = Pings + PIN — r.a) (11.1.75) 
i-1 
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and 


N 
Vr, = > F,(1 — Fy) = Pa — Por, + PI — PIX — 1-1). 
i-1 


(11.1.76) 
The NLWLS estimation of y is defined as that which minimizes 
N 2 
T (r 0 Y F,) 
Y a ee (11.1.77) 


i Vr, 


where Vr, is obtained by estimating Pj and P?by F( Bx, + á'x,) and F( Ê X,), 
respectively, where & and fl are consistent estimates obtained, for example, by 
minimizing ZZ ,(r, — ZI ,F, Y. Alternatively we can minimize 


== + Flo F,(1 — F,), 11.1.78 
Vr, P > log $ i it) ( ) 


which will asymptotically give the same estimator.? Let 7 be the estimator 
obtained by minimizing either (11.1.77) or (11.1.78). Then we have 


e OF & àF, |^ 
T 
JNTÓ — ») — N | 0, | plim +. E PY 


(11.1.79) 


The asymptotic variance-covariance matrix of f can be shown to be larger 
(in matrix sense) than that of f given in (11.1.38) as follows: The inverse of the 
latter can be also written as 


OF, OF 5 


plim +7 73 Seas ày y^ (11.1.80) 


Put z, = [F,(1 — F) !?0F,/8y and a, = [F,(1 — F;)]?. Then the desired 
inequality follows from 


N $ Dig Zi, p ji 


p LN MES = m4m. (11.1.81) 


> a 
iz 
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Finally, we can define the MLE, which maximizes the joint probability of r, 


and7,_,,t=1,2,... , T. The joint probability can be shown to be 
T min[N —n-,,7;] r 
I] ( 71) yer - Pie! (11.1.82) 
t= 1 f= maxir—r—1,0] LAVT l 


x (" 3") (pyra - Pe xul! 


The maximization of (11.1.82) is probably too difficult to make this estimator 
of any practical value. Thus we must use the NLLS or NLWLS estimator if 
only aggregate observations are available, even though the MLE is asymptoti- 
cally more efficient, as we can conclude from the study of Barankin and 
Gurland (1951). 


11.2 Duration Models 
11.2.1 Stationary Models — Basic Theory 


We shall first explain a continuous-time Markov model as the limit of a 
discrete-time Markov model where the time distance between two adjacent 
time periods approaches 0. Paralleling (11.1.2), we define 


Ait) At = Prob[ith person is in state k at time t+ At (11.2.1) 
given that he or she was in state j at time I]. 


In Sections 11.2.1 through 11.2.4 we shall deal only with stationary (possibly 
heterogeneous) models so that we have 4j,(t) = Aj, for all £. 

Let us consider a particular individual and omit the subscript i to simplify 
the notation. Suppose this person stayed in state j in period (0, £) and then 
moved to state k in period (t, t + Ar). Then, assuming ¢/Af is an integer, the 
probability of this event (called A) is 


P(A) = (1 — Àj ADA At, (11.2.2) 


where A; = (211,45) — Aj determines the probability of exiting j. But using the 
well-known identity lim,_,,. (1 — n^! y! = e7', we obtain for small Af 


Because At does not depend on unknown parameters, we can drop it and 
regard exp (— 4,04; as the contribution of this event to the likelihood func- 
tion. The complete likelihood function ofthe model is obtained by first taking 
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the product of these terms over all the recorded events of an individual and 
then over all the individuals in the sample. Because of (11.2.3), a stationary 
model is also referred to as an exponential model. 

Suppose M = 3 anda particular individual’s event history is as follows: This 
person stays in state 1 in period (0, f,), moves to state 2 at time /, and stays 
there until time ¢, + tj, moves to state 3 at time f, + £ and stays there until 
ti +t, + t3, at which point this person is observed to move back to state 1. 
(The observation is terminated after we have seen him or her move to state 1.) 
Then this person’s likelihood function is given by 


L = exp (C Àuti)Ài exp (— Agta Ags exp (7451543. (11.2.4) 


If we change this scenario slightly and assume that we observe this person to 
leave state 3 at time t, + h + t, but do not know where he or she went, we 
should change 44, to A; in (11.2.4). Furthermore, if we terminate our observa- 
tion at time f, + tj + t, without knowing whether he or she continues to stay in 
state 3 or not, we should drop 44, altogether from (11.2.4). In this last case we 
say “censoring (more exactly, right-censoring) occurs at time t, + £j + 43.” 

Let us consider the simple case of M = 2. In this case we have A, = A,, and 
Ay = Aq. (We are still considering a particular individual and therefore have 
suppressed the subscript i.) To have a concrete idea, let us suppose that state 1 
signifies unemployment and state 2, employment. The event history of an 
individual may consist of unemployment spells and employment spells. (If 
the observation is censored from the right, the last spell is incomplete.) The 
individual's likelihood function can be written as the product of two terms — 
the probability of unemployment spells and the probability of employment 
Spells. We shall now concentrate on unemployment spells. Suppose our typi- 
cal individual experienced r completed unemployment spells of duration 


ht... , t, during the observation period. Then the contribution of these r 
spells to the likelihood function is given by 
L= Xe, (11.2.5) 


where we have defined T = 2j_,t; and have written 4 for À, . The individual's 
complete likelihood function is (11.2.5) times the corresponding part for the 
employment spells. 

We now wish to consider closely the likelihood function of one complete 
spell: e^*4. At the beginning of this section, we derived it by a limit operation, 
but, here, we shall give it a somewhat different (although essentially the same) 
interpretation. We can interpret e^? as P(T > t) where Tis a random variable 
that signifies the duration of an unemployment spell. Therefore the distribu- 
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tion function F(- ) of T is given by 


Fj =1-P(T>)=1-e™%. (11.2.6) 
Differentiating (11.2.6) we obtain the density function f(- ) of T: 
S() = Ae. (11.2.7) 


Thus we have interpreted e~“ as the density of the observed duration of an 
unemployment spell. 
From (11.2.6) and (11.2.7) we have 


-SO 


= TRO: (11.2.8) 
The meaning of A becomes clear when we note 
_ SO At 
À At i— FO (11.2.9) 


= Prob [leaves unemployment in 
(t, t + Af)|has not left unemployment in 
(0, t). 


We call A the hazard rate. The term originates in survival analysis, where the 
state in question is “life” rather than unemployment. 

Still concentrating on the unemployment duration, let us suppose that we 
observe one completed unemployment spell of duration t; for the ith individ- 


ual,i=1,2,... , N. Then, defining f (1) = A‘ exp (—4/tj), we can write the 
likelihood function as 
N 
L= JJ), (11.2.10) 


which is a standard likelihood function of a model involving continuous 
random variables. Suppose, however, that individuals i= 1, 2, . . . , n com- 
plete their unemployment spells of duration f, but individuals i — n t 1, 
. .. , N are right-censored at time !& Then the likelihood function is 
given by 


n N 

L-[[f£*) [| 0 - FX. (11.2.11) 
i-1 i=n+1 

which is a mixture of densities and probabilities just like the likelihood func- 

tion of a standard Tobit model. Thus we see that a duration model with 

right-censoring is similar to a standard Tobit model. 


436 Advanced Econometrics 


11.2.2 Number of Completed Spells 


The likelihood function (11.2.5) depends on the observed durations 
ti, t,,... , t, only through r and T. In other words, r and T constitute the 
sufficient statistics. This is a property of a stationary model. We shall show an 
alternative way of deriving the equivalent likelihood function. 

We shall first derive the probability of observing two completed spells in 
total unemployment time T, denoted P(2, T). The assumption that there are 
two completed unemployment spells implies that the third spell is incomplete 
(its duration may be exactly 0). Denoting the duration of the three spells by ¢, , 
t, and f,, we have 


P(2, T)-P(0sun«T,0«t5sT-t,uüzT-—t-—t) (11.2.12) 


- Í Aa) { Í "^ fü) | Í ^ fe) às | às | dz 
0 0 T—-z-zn 


T 
= Í À exp (—Az,) 
0 


T-2, 
x ( Í A exp (— åz, Xexp [- A(T — z, — z,)]} dz) dz, 


_ (AT Ye 
3 ^ 


It is easy to deduce from the derivation in (11.2.12) that the probability of 
observing r completed spells in total time T is given by 


Fo ÀT 
Ury (11.2.13) 


P(r,T) = , 
which is a Poisson distribution. This is equivalent to (11.2.5) because 7" and z! 
do not depend on the unknown parameters. l 

We can now put back the subscript / in the right-hand side of (11.2.5) and 


take the product over / to obtain the likelihood function of all the individuals: 
N 
L = [[ 47 exp (7 AT;). (11.2.14) 
i-1 
Assuming that A, depends on a vector of the ith individual's characteristics x;, 
we can specify 


A; = exp (a + f'x,), (11.2.15) 
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where the scalar œ and the vector f are unknown parameters. We shall derive 
the ML estimators of these parameters.‘ In the following discussion it will 
become apparent why we have separated the constant term o from the other 
regression coefficients. 

The log likelihood function is given by 


log L = » r(a + B’x;) — » T; exp (a + fi'x,). (11.2.16) 
The ML estimators å and B are the solutions of the normal equations: 

DELL Yn "x T, exp (f'x,) =0 (11.2.17) 
and 

a > T, exp (f'x)x, = (11.2.18) 


Solving (11.2.17) for e% and inserting it into (11.2.18) yield 
» lj > T, exp (f'x,)x, 


X rX — Xe (11.2.19) 


Thus B can be obtained from (11.2.19) by the method given in the following 
paragraph. Inserting f into (11.2.17) yields an explicit solution for a. 
The following method of obtaining f was proposed by Holford (1980). 


Define 

_ T, exp (B’x;) 7" 

L= II b T, exp is . (11.2.20) 
i 


Therefore, we have 


log L= V nlg T; + Y rif'x, (11.2.21) 
i i 


= (x r.) log >; T, exp (f x;). 
i i 
Setting the derivative of (11.2.21) with respect to f equal to 0 yields 


ED exp (x) x 


ð E. L 
X "> Trap (x) 0, (11.222) 
i 
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which is identical to (11.2.19). Note that L, is the likelihood function of a 
multinomial logit model (9.3.34). To see this, pretend that the i in (11.2.20) 
refers to the ith alternative and r; people chose the ith alternative. This is a 
model where the exogenous variables depend only on the characteristics of the 
alternatives and not on those of the individuals. Thus the maximization of L, 
and hence the solution of (1 1.2.19) can be accomplished by a standard multi- 
nomial logit routine. 
We can write L, as a part of the likelihood function L as follows: 


(z7) 


I] 77’ 


L=L, M Lı ° (11.2.23) 
where 


li 


ev|-X T, exp 727] b» T, exp ats’) |? 
> ri)! l 


Note that L is a Poisson distribution. Setting 0 log L, /ð&æ = 0 yields (11.2.17). 
We can describe the calculation of the MLE as follows: First, maximize L, 
with respect to f; second, insert f into L, and maximize it with respect to a. 


L- 


(11.2.24) 


11.2.8 Durations as Dependent Variables of a Regression Equation 


Suppose that each individual experiences one complete spell. Then the likeli- 
hood function is 


N 
L = II^ exp (—A,t,). (1 1.2.25) 
i=1 


The case of a person having more than one complete spell can be handled by 
behaving as if these spells belonged to different individuals. Assume as before 


A, = exp (B’x;,). (11.2.26) 


But, here, we have absorbed the constant term o into fl as there is no need to 
separate it out. ` 
We shall derive the asymptotic covariance matrix of the MLE f. We have 


8 log L 


pap’ — — Y, t; exp (B’x,)x,x). (11.2.27) 
i 
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But we have 
Et, = Í zd, exp (—A;z) dz = i (11.2.28) 
0 i 
which is the continuous-time analog of (11.1.65). Therefore we have 
à log L 
E ~ = — D> XX. (11.2.29) 
app > 


Therefore, by Theorem 4.2.4, B is asymptotically normal with asymptotic 
covariance matrix 


vj- (x xxi) (11.2.30) 


Now, suppose we use log t; as the dependent variable of a linear regression 
equation. For this purpose we need the mean and variance of log t,. We have? 


E log t= Àj Í (log z) exp (—A,z) dz (11.2.31) 
0 


— — c — log A; 


where c = 0.577 is Euler's constant, and 


E(log t; = A, Í ° (log z}? exp (—A;z) dz (11.2.32) 
0 


p 
uiri + (c + log A4). 


Therefore we have 


2 


V log t= = (11.2.33) 


We can write (11.2.31) and (11.2.33) as a linear regression 


where Eu; = 0 and Vu, = 12/6. Because (u;) are independent, (11.2.34) de- 
fines a classical regression model, which we called Model Jl in Chapter 1. 
Therefore the exact covariance matrix of the LS estimator fl, is given by 


^ 2 -1 
Vfjs- = (x xx) . (11.2.35) 
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Comparing (11.2.35) with (11.2.30), we see exactly how much efficiency we 
lose by this method. 

Alternatively, we can define a nonlinear regression model using t; itself as 
the dependent variable. From (11.2.28) we have 


t; = exp (-B'xj) + vj, (11.2.36) 
where Ev; = 0. We have by integration by parts 
Ett = Í z? A, exp (—A,z) dz = =. (11.2.37) 
0 i 
Therefore we have 
Vv; = exp (—2f!' xj). (11.2.38) 


This shows that (11.2.26) defines a heteroscedastic nonlinear regression 
model. The asymptotic normality of the NLWLS estimator fs can be 
proved under general assumptions using the results of Sections 4.3.3 and 
6.5.3. Its asymptotic covariance matrix can be deduced from (4.3.21) and 
(6.1.4) and is given by 


VÁas- È exp Qf'x) 2 ap - (x xx) - (11.2.39) 


Therefore the NLWLS estimator is asymptotically efficient. In practice Vv, 
will be estimated by replacing f by some consistent estimator of fl, such as frs 
defined earlier, which converges to f at the speed of Vn. The NLWLS estima- 
tor using the estimated Vv; has the same asymptotic distribution. 

The foregoing analysis was based on the assumption that ¢,is the duration of 
a completed spell of the ith individual. If some of the spells are right-censored 
and not completed, the regression method cannot easily handle them. How- 
ever, maximum likelihood estimation can take account of right-censoring, as 
we indicated in Section 11.2.1. 


11.2.4 Discrete Observations 


In the analysis presented in the preceding three subsections, we assumed that 
an individual is continuously observed and his or her complete event history 
during the sample period is provided. However, in many practical situations a 
researcher may be able to observe the state of an individual only at discrete 
times. If the observations occur at irregular time intervals, it is probably more 
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reasonable to assume a continuous-time Markov model rather than a dis- 
crete-time model. 

We shall derive the likelihood function based on discrete observations in the 
case of M = 2. The underlying model is the same stationary (or exponential) 
model we have so far considered. We define 


Pi, (t) = Prob [ith person is in state k at time t (11.2.40) 
given that he or she was in state j at time 0]. 


Note that this definition differs slightly from the definition (11.1.2) of the 
same symbol used in the Markov chain model. We shall use the definition 
(11.2.1) with the stationarity assumption Aj, (£) = Aj, for all t. 

As before, we shall concentrate on a particular individual and suppress the 
superscript i. When At is sufficiently small, we have approximately 


P(t + At) = Py (Ay2 At + PAy(tY1 -— 421 At). (11.2.41) 
Dividing (11.2.41) by At and letting At go to O yield 


aP 
dt 


Performing an analogous operation on P,,, P5, , and P22, we obtain the linear 
vector differential equation 


= Pág 7 Pala. (11.2.42) 


aP, dP 
dt dt —Ài2 Àn | le Pa 
= ; 11.2.43 
dP dPz, | Ai hg Pa Pa ( ) 
dt dt 
dP’ 4a, 
or AP’. 


A solution of (11.2.43) can be shown to be® 
P/—He*H-!, (11.2.44) 


where the columns H are characteristic vectors of A, D is the diagonal matrix 
consisting of the characteristic roots of A, and e™ is the diagonal matrix 
consisting of exp (d;t), d, being the elements of D. 

We shall derive D and H. Solving the determinantal equation 


JA — aI|=0 (11.2.45) 
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yields the two characteristic roots d, = 0 and d; = —(4;; + A,,). Therefore we 


have 
0 0 
D= . 11.2.46 

lo — (Aia + Az | ( ) 


Let h, be the first column of H (the characteristic vector corresponding to the 
zero root). Then it should satisfy Ah, = 0, which yields a solution (which is 
not unique) h, = (45, 4,2)’. Next, the second column h, of H should satisfy 
[A + (Ay. + Az, I]h; = 0, which yields a solution h, = (—1, 1)’. Combining 
the two vectors, we obtain 


= Ax —1 
H 5 |] (11.2.47) 


Finally, inserting (11.2.46) and (11.2.47) into (11.2.44) yields the following 
expressions for the elements of P’ (putting back the superscript i): 


Pi (t) 5 1 — y; + y; exp (—6,t) (11.2.48) 
Pi (t) = y: — y: exp (~ 4,1) 

P(A)= 1 — y (1 — y) exp (— 6,2) 

P(t) = 9; t (1 — Y) exp (— 6,2), 


where y; = 41;/(41; + Ai, Jand ô; = Ai, + 44, . Suppose we observe the ith indi- 
vidual in state j; at time 0 and in state k;at time t,, i= 1, 2,. . . , N. Then the 
likelihood function is given by 


N 
L= [[ Pj). (11.2.49) 


i=l 


11.2.5 Nonstationary Models 


So far we have assumed 4i, (t) = di, for all t (constant hazard rate). Now we 
shall remove this assumption. Such models are called nonstationary or semi- 
Markov. 

Suppose a typical individual stayed in state jin period (0, £) and then moved 
to state k in period (f, t + At). We call this event A and derive its probability 
P(A), generalizing (11.2.2) and (11.2.3). Defining m= t/At and using 
log (1 — €) = —e for small e, we obtain for sufficiently large m 
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ttt 


[Apos om 
- {exp E f AQ) zl y(t) —. 


The likelihood function of this individual is the last expression in (11.2.50) 
except for t/m. 

Let us obtain the likelihood function of the same event history we consid- 
ered in the discussion preceding (11.2.4). The likelihood function now be- 
comes 


h 
L= exp | | A(z) as] Alti) (11.2.51) 
0 
uth, 
x exp |-f A2(z) az| Az (f; + 15) 
h 


ft t3 t 
x exp -Í A3(z) de |as (t t t; + t3). 
tjt; 

Asin (11.2.4), 4,, should be changed to A, if the individual is only observed to 

leave state 3, and A;, should be dropped if right-censoring occurs at that time. 
We shall concentrate on the transition from state | to state 2 and write 4,; (f) 

simply as A(t), as we did earlier. The distribution function of duration under a 

nonstationary model is given by 


F(t) 21— exp |- f A(z) a|, (11.2.52) 
0 


which is reduced to (11.2.6) if A(t) = A. The density function is given by 


S(t) = A(t) exp |- f A(z) «| (11.2.53) 
0 


The likelihood function again can be written in the form of (11.2.10) or 
(11.2.11), depending on whether right-censoring is absent or present. 

Thus we see that there is no problem in writing down the likelihood func- 
tion. The problem, of course, is how to estimate the parameters. Suppose we 
specify A'(t) generally as 


A(t) = g(x,, B). (11.2.54) 
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To evaluate f§A'(z) dz, which appears in the likelihood function, a researcher 
must specify x, precisely as a continuous function of t — not an easy thing to 
do in practice. We shall discuss several empirical articles and shall see how 
these articles deal with the problem of nonstationarity, as well as with other 
problems that may arise in empirical work. 


Model of Tuma. Tuma (1976) analyzed the duration of employment at a 
particular job. The ith individual's hazard rate at time ¢ is specified as 


A (t) = B'x, t at + ab, (11.2.55) 
where x, is a vector of socioeconomic characteristics ofthe ith individual." The 
parameter values are assumed to be such that the hazard rate is nonnegative 
over a relevant range of t. Because of the simple form of the hazard rate, it can 
easily be integrated to yield 


I TERES (11.2.56) 
o i 2 3 . ry “Ty 


Some people terminate their employment during the sample period, but some 
remain in their jobs at the end of the sample period (right-censoring). There- 
fore Tuma's likelihood function is precisely in the form of (11.2.11). 


Model of Tuma, Hannan, and Groeneveld. Tuma, Hannan, and Groene- 
veld (1979) studied the duration of marriage. They handled nonstationarity 
by dividing the sample period into four subperiods and assuming that the 
hazard rate remains constant within each subperiod but varies across different 
subperiods. More specifically, they specified 


A(t) = Bix; for t€7,, p=1,2, 3,4, (11.2.57) 


where T, is the pth subperiod. This kind of a discrete change in the hazard rate 
creates no real problem. Suppose that the event history of an individual 
consists of a single completed spell of duration ¢ and that during this period a 
constant hazard rate A(1) holds from time 0 to time t and another constant rate 
A(2) holds from time t to time ¢. Then this individual's likelihood function is 
given by 


L = e Xwg-390Xi79)(2). (11.2.58) 


Model of Lancaster. Lancaster (1979) was concerned with unemployment 
duration and questioned the assumption of a constant hazard rate. Although a 
simple search theory may indicate an increasing hazard rate, it is not clear 
from economic theory alone whether we should expect a constant, decreasing, 
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or increasing hazard rate for unemployment spells. Lancaster used a Weibull 
distribution, which leads to a nonconstant hazard rate. The Weibull distribu- 
tion is defined by 


F(t) = 1 — exp (—Ar). (11.2.59) 


Ifa = 1,(11.2.59) is reduced to the exponential distribution considered in the 
preceding subsections. Its density is given by 


S(t) = date"! exp (— Af?) (11.2.60) 
and its hazard rate by 
_ 40 . 4 
A) =F FO Aout}, (11.2.61) 


Whether a is greater than 1 or not determines whether the hazard rate is 
increasing or not, as is shown in the following correspondence: 


ar>le a >0 (increasing hazard rate) (11.2.62) 
oA 
a=la ar 0 (constant hazard rate) 


a<le e «0 (decreasing hazard rate). 


Lancaster specified the ith person's hazard rate as 
Ai(t) = at! exp (fx), (11.2.63) 


where x; is a vector of the ith person's characteristics. His ML estimate of a 
turned out to be 0.77, a result indicating a decreasing hazard rate. However, 
Lancaster reported an interesting finding: His estimate of o increases as he 
included more exogenous variables in the model. This result indicates that the 
decreasing hazard rate implied by his first estimate was at least partly due to 
the heterogeneity caused by the initially omitted exogenous variables rather 
than true duration dependence. 

Because it may not be possible to include all the relevant exogenous vari- 
ables, Lancaster considered an alternative specification for the hazard rate 


LO = v AD, (11.2.64) 


where A(t) is as given in (11.2.63) and v;is an unobservable random variable 
independently and identically distributed as Gamma(1, ¢?). The random 
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variable v, may be regarded as a proxy for all the unobservable exogenous 
variables. By (11.2.52) we have (suppressing i) 


F(t|v) = 1 — exp [-vA(t)], (11.2.65) 
where A(t) = f44(z) dz. Taking the expectation of (11.2.65) yields 
F*() = E,F(t|) = 1 — [1 + A(T. (11.2.66) 


Therefore, using (11.2.8), we obtain 
A*(t) = ADU + c?A(0]! (11.2.67) 
= AQ — F*(»nJ". 


Because [1 — F*(t)]°’ is a decreasing function of t, (11.2.67) shows that the 
heterogeneity adds a tendency for a decreasing hazard rate. Under this new 
model (with c? as an additional unknown parameter), Lancaster finds the 
MLE ofa.to be 0.9. Thus he argues that a decreasing hazard rate in his model is 
caused more by heterogeneity than true duration dependence. 


Model of Heckman and Borjas. In their article, which is also concerned 
with unemployment duration, Heckman and Borjas (1980) introduced an- 
other source of variability of A in addition to the Weibull specification and 
heterogeneity. In their model, 4 also varies with spells. Let / denote the /th 
unemployment spell the ith individual experiences. Then Heckman and 
Borjas specify the hazard rate as 


A (t) = at! exp (fiXa + v), (11.2.68) 


where v; is unobservable and therefore should be integrated out to obtain the 
marginal distribution function of duration.? 


Model of Flinn and Heckman. In a study of unemployment duration, 
Flinn and Heckman (1982) generalized (11.2.68) further as 


in fi ta— 1] 
AM t) = exp | Bix,(D) + cv; + Y — + 3*4 —— |- 
A, A, 


The function (t^ — 1)/A is the Box-Cox transformation (see Section 8.1.2) and 
approaches log t as A approaches 0. Therefore putting 4, = 0 and y, = 0 in 
(11.2.69) reduces it to a Weibull model. Note that xis assumed to depend on t 
in (11.2.69). Flinn and Heckman assumed that changes in x;,(¢) occurred only 
at the beginning of a month and that the levels were constant throughout the 
month. The authors devised an efficient computation algorithm for handling 


(11.2.69) 
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the heterogeneity correlated across spells and the exogenous variables varying 
with time. 


11.2.6 Problem of Left-Censoring 


In this subsection we shall consider the problem of left-censoring in models for 
unemployment duration. When every individual is observed at the start of his 
or her unemployment spell, there is no problem of left-censoring. Often, 
however, an individual is observed to be in an unemployment spell at the start 
of the sample period. This complicates the estimation. 

Suppose that an unemployment spell of a particular individual starts at 
time —s, the individual is interviewed at time 0, and the spell terminates at £. 
(To simplify the analysis, we assume that right-censoring at time ¢ does not 
occur.) The treatment of left-censoring varies according to the following three 
cases: (1) s is observed but z is not observed, (2) both s and t are observed, and 
(3) tis observed but sis not observed. For each case we shall derive the relevant 
likelihood function. 

The first case corresponds to the situation analyzed by Nickell (1979), 
although his is a Markov chain model. Assuming that the underlying distribu- 
tion ofthe duration is F( - } and its density f( - ), we can derive the density g(s). 
Denoting the state of “being unemployed” by U, we have for sufficiently 
small As 


g(s) As = P[U started in (—s — As, —s)|U at 0] (11.2.70) 
_ P[U at OJU started in (—s — As, —s)]P[U started in (—s — As, —5)] 
f ° (Numerator) ds 
_ PIU at 0|U started in (—s — As, —s)] As 
f ° (Numerator) ds 
. L-F(sjAs__ [1 —F(s)] As 
fu - F(s)] ds ' 


where ES = J§sf(s) ds. In (11.2.70) the third equality follows from the as- 
sumption that P[U started in (—s — As, —s)] does not depend on s (the as- 
sumption of constant entry rate), and the last equality follows from integra- 
tion by parts. By eliminating As from both sides of (11.2.70), we obtain 
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g(s) = x. (11.2.71) 


In the second case we should derive the joint density g(s,/) = g(t|s)g(s). The 
density g(s) has been derived, but we still need to derive g(t|s). Let X denote 
total unemployment duration. First evaluate 


P(X»stt,X»s) 


P(X>s+tX>s)= PO 5) (11.2.72) 
_PX>sth 
P(X » s) 
_1-F(st9 
| — F(s) 
If we denote the distribution function of g(t|s) by G(t|s), (11.2.72) implies 
_ F(st t) 
G(t|s) T— Fo)" (11.2.73) 
Therefore, differentiating (11.2.73) with respect to t, we obtain 
BP Cha) 
&(t|s) D— FG (11.2.74) 
Finally, from (11.2.71) and (11.2.74), we obtain 
IG D 
(st) ES | (11.2.75) 


This situation holds for Lancaster (1979), as he observed both s and t. 
However, Lancaster used the conditional density (11.2.74) rather than the 
joint density (11.2.75) because he felt uncertain about the assumption of 
constant entry rate. 

Finally, in the third case we need g(t). This can be obtained by integrating 
g(5,t) with respect to s as follows: 


— 1 2 
g= ES Í fis+ i ds (11.2.76) 


_1-F® 
ES ' 


See Flinn and Heckman (1982) for an alternative derivation of g(t). 
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11.2.7 Cox’s Partial Maximum Likelihood Estimator 
Cox (1972) considered a hazard rate of the form 


At) = A(t) exp (f’x;), (11.2.77) 


which generalizes (11.2.63). This model is often referred to as a proportional 
hazards model. Cox proposed a partial MLE (PMLE) obtained by maximiz- 
ing a part of the likelihood function. We shall first derive the whole likelihood 
function and then write it as a product of the part Cox proposed to maximize 
and the remainder. 

Cox allowed for right-censoring. Let t, i= 1, 2,. . . , n, be completed 
durations and let 4, /— n - 1, n -2,. . . , N, be censored durations. We 
assume for simplicity {¢,} are distinct? Then the likelihood function has the 
form (11.2.11): specifically, 


L- I exp (f'x,)A(t;) exp [—exp (f'x,)A(t;)] (11.2.78) 


N 
x I] exp [- exp (f/x,)A(t;)], 


in 


where A(t) = [4A(z) dz. Combining the exp functions that appear in both 
terms and rewriting the combined term further, we obtain 


L= Il exp (f’x,)A(t;) - exp |- 5 exp (fx) [ A(z) "1 (11.2.79) 
i-1 im! 0 


= Il exp (fx, (tj) - exp {- f i | > exp dx» A(t) a), 


i=] heR() 


where R(t) = (i|t; z t). To understand the second equality in (11.2.79), note 
that Brera exp (B’X,) is a step function described in Figure 11.1. 
Cox's PMLE fj, maximizes 


_— ty EP (Hx) 
L, IES exp (BX) (11.2.80) 


heR(t) 


It is a part of L because we can write L as 


L=L,L,, (11.2.81) 
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Figure 11.1 Z,egg exp (f'x,) as a function of t 


where 


L= "i > exp qo] (11.2.82) 


i-1 LAeR(t) 


X exp {- al > exp (a's) |; aj. 
0 heR(t) 


Because L, does not depend on A(t), Cox's method enables us to estimate fl 
without specifying A(t). Cox (1975) suggested and Tsiatis (1981) proved that 
under general conditions the PMLE is consistent and asymptotically normal 
with the asymptotic covariance matrix given by 


> _| p# log L |"! 
VB, = E E7 | . (11.2.83) 
This result is remarkable considering that L, is not even a conditional likeli- 
hood function in the usual sense.!? 

However, L, and L, do have intuitive meanings. We shall consider their 
meanings in a simple example. Suppose N = 2 and t, is a completed duration 
(say, the first person dies at time 1, ) and t; a censored duration (the second 
person lives at least until time ¢,), with 4, > ¢,. Then we have 


L= A! (t,) exp |- f A! (z) à: exp |- f A(z) a|, (11.2.84) 
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We can write (11.2.84) as 


L= Lia, (11.2.85) 
where 
2.2 A) — 
L= AG) 20)» (11.2.86) 
La = A(t) +), (11.2.87) 
1 h 
Ly = exp |- Í ' A! (z) d exp |- f A?(z) al, (11.2.88) 
0 0 
and 
L3 = exp |- f A?(z) - (1 1.2.89) 
h 


These four components of L can be interpreted as follows: 
L, = P(#1 dies at t, [both #1 and #2 live until t, and either #1 or #2 dies 
at t,) 

Lj = P(either #1 or #2 dies at /,|both #1 and #2 live until ¢,) 

La = P(both #1 and #2 live until 4) 

L4, = P(#2 lives at least until f.|#2 lives until ¢,). 
Note that L3, L22L23 corresponds to L, of (11.2.82). 

Kalbfleisch and Prentice (1973) gave an alternative interpretation of Cox's 
partial likelihood function. First, consider the case of no censoring. Let 


t €t <. . . €ty bean ordered sequence of durations. Then by successive 
integrations we obtain 
P( <<... <ty) (11.2.90) 


-Ff | F [I 4t) exp (fx) 
0 t ty- del 


X exp E (B’x;) [ A(z) dz as diy... .dt 
0 


o eo o N-2 
-f Í . [] A6) exp (B’x,) 


ty-2 fl 


X exp |- exp (B’x,) Í " A(z) az| 
0 


X A(ty-,) exp (B'xy..,) 
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X exp {- [exp (B’xy_,) + exp (B’xy)] i T A(z) «| 


X dty_, dty_.. . . dt, 


eo LJ æ N-2 
= Í Í el. I] 4€) exp (8’x;) 
0 Hi tw-3 il 
X exp |- exp (f'x,) Í " A(z) a| 
0 
exp (f'xy. i) 
exp (B’xy_,) + exp (f'xy) 
X exp {- [exp (B’Xy_1) + exp (B’xw)] Í "* M2) dz} 
X dty-2 dty 4. . . df 
= exp (f'xy) exp ('xy.,)* * * exp (f'xi) 
+ (exp (B’xy)[exp (A’xy) + exp (B’xy_;)] 
: [exp (B’xy) + exp (f'xy-,) +. . . + exp (A’x,)]), 


Next, suppose that completed durations are ordered ast; « £j <... € f, 
and in the interval [t,, t+) i— 1,2,. .. , n (with the understanding 
[441 = ©), we observe censored durations fj, £5, . . . , tig, Then we have 


P(t, «t «X... tli S Into... , tig for all i) (11.2.91) 


=Í Í M .|[ P(t; S tastos. . . 5 tig, for all iltj,t,. . . sth) 
0 f fat 


Xf (t)/?(5) . . .f^(t) dt, dt, 4... . dt 


- Í i [.. f EO exp (f'x,) 
0 Jh t—1 il 


X exp {- |o (f'xj)) + S exp &x)| f A(z) dz} 
j=l 0 
Xdi,dt, 4... .dt 
= exp (f'x,) exp (f'x,-,): : exp (f'xi) 
EN ([exp (f'x,) + c,][exp (f'x,) + exp (B’x,-1) + C, + C,-1] 
S [exp(f/x,) 9... exp(f/'xi) tet... cl), 


where c; = 5%, exp (f'x,). 
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Because the parameter vector f appears in both L, and L,, we expect Cox's 
PMLE to be asymptotically less efficient than the full MLE in general. This is 
indeed the case. We shall compare the asymptotic covariance matrix of the 
two estimators in the special case of a stationary model where the A(t) that 
appears in the right-hand side of (1 1.2.77) isa constant. Furthermore, we shall 
suppose there is no censoring. In this special case, Cox’s model is identical to 
the model considered in Section 11.2.3. Therefore the asymptotic covariance 
matrix of the MLE £ can be derived from (11.2.30) by noting that the f of the 
present section can be regarded as the vector consisting of all but the last 
element of the £ of Section 11.2.3. Hence, by Theorem 13 of Appendix 1, we 
obtain 

N 


a —1 

VB- p «3-3» | (11.2.92) 
i=] 

The asymptotic covariance matrix of Cox’s PMLE can be derived from 

(11.2.83) as 


4 N -2 
VB,=— jE > b exp (x (11.2.93) 
h 


i-1 


X | E xax exp ('x,) X, exp (f'x,) 
h h 
-1 
— Y x, exp (fx,) D xj exp |} ; 
k h 


where ©, denotes Y,c 4, and E denotes the expectation taken with respect to 
random variables t; that appear in R(f,). 

This expectation is rather cumbersome to derive in general cases. There- 
fore, we shall make a further simplification and assume £ is a scalar and equal 
to 0. Under this simplifying assumption, we can show that the PMLE is 
asymptotically efficient. (Although this is not a very interesting case, we have 
considered this case because this is the only case where the asymptotic var- 
iance of the PMLE can be derived without lengthy derivations while still 
enabling the reader to understand essentially what kind of operation is in- 
volved in the expectation that appears in (11.2.93). For a comparison of the 
two estimates in more general cases, a few relevant references will be given at 
the end.) Under this simplification (11.2.93) is reduced to 


2131-1 
N 
. 25 D Xh 
AERLE ET 


(11.2.94) 


454 Advanced Econometrics 


We shall first evaluate (1 1.2.94) for the case N = 3 and then for the case of 
general N. If t, < t < b, we have 


x; 
a| 25 xtX ox td 
Axio o3 03 


(11.2.95) 


If we change the rank order of (¢,, t, t3), the right-hand side of (11.2.95) will 
change correspondingly. But, under our stationarity (constant A) and homoge- 
neity (f = 0) assumptions, each one of six possible rank orderings can happen 
with an equal probability. Therefore we obtain 


3 Di 3 
EX| = |=. p (11.2.96) 


Similarly, if t; < t; < t4, we have 


2 
Š Li" atata P at lup? 
al» 27 X3 2 3 X3 
iz | Es pep E cam 


By using an argument similar to that above, we obtain 


3| | 1 


2 
EX yr] 3 3G tzt ))3 2 (11.2.98) 
h 


«(t £) > Y, 2x;%;. 


i1 jeit+l 


Generalizing (1 we) and (11.2.98) to a general N, we obtain 


= 3x} (11.2.99) 


EX i=] 


and 


N 2 ? ININ ; 
EY Yi “Woke (11.2.100) 


i=] ] '*i-1 


N k— 17 N 


*xw-5À x k à Tenia 
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The coefficient on Z5; Zt, ,2x,x, can be derived as follows: There are N! 
permutations of N integers 1, 2, . . . , N, each of which happens with an 
equal probability. Let P, be the number of permutations in which a 
given pair of integers, say, | and 2, appear in the last k positions. Then 
P, = (E2[(N — 2)!]. Therefore the desired coefficient is given by (NT)! Zi., 
(Q2[(N — 2)!]/k?}. 

Finally, from (11.2.94), (11.2.99), and (11.2.100), we conclude 


a 1 N k—]1N -a -1 
=-|_- yr — . 1.2.101 
Compare (11.2.101) with " given in (11.2.92). Because 
VET < 
1— —1 wae = 1, (11.2.102) 


we see that Cox’s PMLE is asymptotically efficient in this special case. 
Kalbfleisch (1974) evaluated the asymptotic relative efficiency of the 
PMLE also for nonzero f) in a model that is otherwise the same as the one we 
have just considered. He used a Taylor expansion of (11.2.93) around fi = 0. 
Kay (1970) extended Kalbfleisch's results to a case where fi is a two-dimen- 
sional vector. Han (1983), using a convenient representation of the asymp- 
totic relative efficiency ofthe PMLE obtained by Efron (1977), evaluated it for 
Weibull as well as exponential models with any number of dimensions of the 8 
vector and with or without censoring. Cox's estimator is found to have a high 
asymptotic relative efficiency in most of the cases considered by these authors. 
However, it would be useful to study the performance of Cox's estimator in 
realistic situations, which are likely to occur in econometric applications. 


Exercises 


1. (Section 11.1.1) 
Using (11.1.5), express the unconditional mean and variance of y'(2) asa 
function of the mean and variance of y'(0). 


2. (Section 11.1.1) 
Write L and L, defined in (11.1.10) explicitly in the special case where 
Pi (t) = P4, M —2, and T—2. 


3. (Section 11.1.1) 
Findanexample ofa matrix that is not necessarily positive but for which a 
unique vector of equilibrium probabilities exists. 
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4. 


7. 


(Section 11.1.1) 
Prove the statement following (11.1.15). 


. (Section 11.1.1) 


Verify statement (11.1.23). 


. (Section 11.1.3) 


Derive the asymptotic variance-covariance matrix of the MLE of æ in the 
Boskin-Nold model using (11.1.38). 


(Section 11.1.3) 
The mean duration is derived in (11.1.65). Using a similar technique, 
derive Vt. 


. (Section 11.1.5) 


Let the Markov matrix of a two-state (1 or 0) stationary homogeneous 
first-order Markov chain model be 


[^ i _ | —A 4] 

Por Poo à dg 

where A is the only unknown parameter of the model. Define the follow- 
ing symbols: 


ny Number of people who were in state j at time 0 and are in state k 
at time 1 

nj. Number of people who were in state j at time 0 

n., Number of people who are in state j at time 1 


We are to treat n;. as given constants and 7; and 7., as random variables. 
a. Supposingn,. = 10, ny. = 5,n.,— 8, and n.o = 7, compute the least 
squares estimate of 4 based on Eq. (11.1.73). Also, compute an estimate 
of its variance conditional on n,. . 

b. Supposing n,, — 7 and n, = | in addition to the data given in a, 
compute the MLE of A and an estimate of its variance conditional on n. . 


. (Section 11.1.5) 


Prove that minimizing (11.1.78) yields the asymptotically same estimator 
as minimizing (11.1.77). 


. (Section 11.1.5) 


Write down the aggregate likelihood function (11.1.82) explicitly in the 
following special case: T= 1, N — 5, ro = 3, and 7, = 2. 


11. 


12. 


13. 
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(Section 11.2.5) 
Verify (11.2.50). 


(Section 11.2.5) 

Consider a particular individual. Suppose his hazard rate of moving from 
state 1 to state 2 is ot + fl, k, where t is the duration in state 1 and k 
denotes the kth spell in state 1. The hazard rate of moving from state 2 to 1 
is oot + fak, where t is the duration in state 2 and k denotes the kth spell in 
state 2. Suppose he was found in state 1 at time 0. This is his first spell in 
state 1. (He may have stayed in state 1 prior to time 0.) Then he moved to 
state 2 at time t,, completed his first spell in state 2, moved back to state 1 
at time /;, and stayed in state 1 at least until time 4. Write down his 
contribution to the likelihood function. Assume o, a > 0. 


(Section 11.2.5) 
Consider a homogeneous nonstationary duration model with the hazard 
rate 


A(t) = at, a> 0. 


Supposing we observe n completed spells of duration f,, t.,... , t, and 
N — n censored spells of duration t,4,,l,42)- . . , ty, derive the MLE of 
a and its asymptotic variance. 


. (Section 11.2.5) 


Let F(t|A) = 1 — e~# where A is a random variable distributed with den- 
sity g(-). Define A(t) = f(r)/[1 — F(0], where F(t) = E,F(t|A) and f(t) = 
dF /dt. Show dA(t)/dt < 0 (cf. Flinn and Heckman, 1982). 


Appendix 1 
Useful Theorems in Matrix Analysis 


The theorems listed in this appendix are the ones especially useful in econo- 
metrics. All matrices are assumed to be real. Proofs for many of these 
theorems can be found in Bellman (1970). 


I. For any square matrix A, with distinct characteristic roots, there exists a 
nonsingular matrix P such that PAP“! = A, where A isa diagonal matrix with 
the characteristic roots of A in the diagonal. If the characteristic roots are not 
distinct, A takes the Jordan canonical form (see Bellman, p. 198). 


2. The determinant of a matrix is the product of its characteristic roots. 


3. For any symmetric matrix A, there exists an orthogonal matrix H such 
that H’ H = I and H’AH = A, where A is a diagonal matrix with the charac- 
teristic roots (which are real) of A in the diagonal. The ith column of H is called 
the characteristic vector of A corresponding to the characteristic root of A that 
is the ith diagonal element of A. 


4. For a symmetric matrix A, the following statements are equivalent: 


(i) A is a positive definite. (Write A > 0.) 

(ii) x’Ax is positive for any nonzero vector x. 
(iii) Principal minors of A are all positive. 
(iv) Characteristic roots of A are all positive. 


The above is true if we change the word positive to nonnegative. 


5. Forany matrices A and B, the nonzero characteristic roots of AB and BA 
are the same, whenever both AB and BA are defined. 


6. tr AB — tr BA. 


7. For any square matrix A, tr A is equal to the sum of its characteristic 
roots. 


8. Let A, B be symmetric matrices of the same size. À necessary and 
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sufficient condition that there exists an orthogonal matrix H such that both 
H’ AH and H’BH are diagonal is that AB = BA. 


9. Any nonnegative (semipositive) definite matrix A can be written as 
A=TT’, where T is a lower triangular matrix with nonnegative diagonal 
elements. 


10. Let4,, 45,. . . , A, be the characteristic roots of a symmetric matrix A 
in descending order (4, being the largest). Then 


a= max X Ax (- xian) 


x XX x{X 
x’ Ax x3 AX, 
A, = max — (- xis , 
x’x,=0 x x X5X2 


+ 
A = max X Ax and so on. 
x’x,;=0 XX 
x'x?-0 
11. Let A and B be symmetric matrices (n X n) with B nonnegative defi- 
nite. Then u,(A + B) = 4,(A), i, = 1,2, .. . , n, where A's and y’s are the 
characteristic roots in descending order. The strict inequality holds if B is 
positive definite. 


A B 
C D 


3 A B un E^! —E^!BD^! 
` C D —Dp^!cE^! F^! 
where E-—-A-BD^C,F—-D-—CA^!B,E^" = A`! + A^"!BF^'CA !, 
andF^! —D^! + D^!CE^!BD"!. 
14. Let A be an n X r matrix of rank r. A matrix of the form P= 


A(A' A) !A' is called a projection matrix and is of special importance in 
statistics. 


12. «| ]- 5: - 576 if |D| #0. 


(i) P = P’ = P? (Hence, P is symmetric and idempotent.) 
(i) rank (P)=r. 
(iii) Characteristic roots of P consist of r ones and n — r zeros. 
(iv) If x — Ac for some vector c, then Px — x. (Hence the word pro- 
jection.) 
(v) M-I-— A(A' Ay''A'isalsoidempotent, with rank n — r, its char- 
acteristic roots consisting of n — r ones and r zeros, and if x = Ac, 
then Mx = 0. 
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(vi) P can be written as G'G, where GG’ =I, or as v, vj + vvi + 
. . v, vL, where v, is a vector and r = rank (P). 


15. Let A bean n X r matrix ofrank r. Partition Aas A = (A,, A;)suchthat 
A, is n X r, and A, is n X r,. If we define A, = [I— A (A1A)) 'A(]A;, then 


we have A(A' A-!A' = A(A1 AJ !A; + AAA.) A}. 


16. If A is positive definite and B is symmetric, there exists a nonsingular 
matrix C such that C’ AC = I, C'BC = D, where D is diagonal and the diago- 
nal elements of D are the roots of |B — AA| = 


17. Let A and B be symmetric nonsingular matrices ofthe same size. Then 
AzBz0implesB^! = A~! 


18. Let A beany nonsingular matrix. The characteristic roots of A~! are the 
reciprocals of the characteristic roots of A. 


19. Let A and B be nonsingular matrices of the same size such that A + B 
and A^! + B^! are nonsingular. Then 


G) (A+B)! -A-(A-! + B-!)-!B-! 
(i) A-!—(A- BY! = A-A + BOAT, 


20. Let X bea matrix with a full column rank. Then 
(I4 XX’)! =I- XI - X’X)'X’. 
21. Rules for Matrix Differentiation. € is an element of X. Every other 


symbol is a matrix. | || denotes the absolute value of the determinant. 
. d _, dX 
(i) UT dr Y 
dX 
» , aX 
(ii) dE É d og IXI = tr X^ de t 
(ii) d — log lAl = (A^)! 
dA 9E 
1 a -l= —X-! dX -1 
(iv) dé X XxX- dé X 


(v) p tr X^ 1Y ——tr (x^ a xv) 


1 d -l — 1 1 
(vi) TATA B=—A7'BA~ 
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. d _p 
(vii) dA tr AB =B 
„o d , , 
(viii) JA tr ABA’C = CAB + C'AB’ 
. d 
(ix) — tr A’BAC = B'AC' + BAC 
dA 
22. Kronecker Products. Let A = {a,j} be a K X L matrix and let B be a 
M X N matrix. Then the Kronecker product A © B is a KM X LN matrix 
defined by 


a,,B à,;B M M a,,B 

d; B ân B . M a4, B 
A@B= . 

aGagB agB : + aygB 


(i) (A © BYC © D) = AC @ BD if AC and BD can be defined. 
(ii) tr(A © B)=trA- tr B if A and B are square. 
(ii) Let A be a KXK matri with characteristic roots 


Ài, Àz. . . - , Akandlet B be a M X M matrix with characteristic 
TOOtS 4, , 42, . . . , Hm. Then the KM characteristic roots of A © B 
areAji, 171,2, .. , K,j=1,2,. .. , M. 


(iv) Let A and B be as in (iii). Then 
[A € B|— |A|M : |B|F. 


(v) Let A, B, and C be matrices of appropriate sizes such that ABC can 
be defined. Suppose A has L columns and write the L columns as 
A=(A,, A5. . . , Aj). Define vec (A) = (A4, A5,. . . , AZ. 
Then 


vec (ABC) = (C' © A) vec (B). 


23. Hadamard Products. (Minc and Marcus, 1964, p. 120.) Let A = {a;;) 
and B = {b,j} be matrices of the same size. Then the Hadamard product A * B 
is defined by A * B = (a,,5,,). 

(i) Let A and B be both n X n. Then A * B = S'(A © BIS, where S is the 
n? X n matrix the ith column of which has | in its [( — 1)n + i]th position 
and O elsewhere. 


Appendix 2 
Distribution Theory 


The theorems listed in this appendix, as well as many other results concerning 
the distribution of a univariate continuous random variable, can be found in 
Johnson and Kotz (1970a,b). "Rao" stands for Rao (1973) and “Plackett” for 
Plackett (1960). 


1. Chi-Square Distribution. (Rao, p. 166.) Let an n-component random 
vector z be distributed as N(0, I). Then the distribution of z'z is called the 
chi-square distribution with n degrees of freedom. Symbolically we write 
z'z ~ x2. Its density is given by 

SO) = 27"? T (n/2)e *P x "^t, 


where I'(p) = f24?71e7? dd is called a gamma function. Its mean is n and its 
variance 27. 

2. (Rao, p. 186). Let z ~ N(0, I) and let A bea symmetric and idempotent 
matrix with rank n. Then z' Az ~ x2. 

3. Student's t Distribution. (Rao, p. 170.) If z ~ N(0, 1) and w ~ x2andifz 


and w are independent, n!?zw-? has the Student's ¢ distribution with n 
degrees of freedom, for which the density is given by 


E 


where 
T(y)l'(9) : 
, ô) = —— ——, beta function. 
BG. à) Ty 4 à) 
Symbolically we write 


1/2 
(2) Z~ Sp 
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4. F Distribution. (Rao, p. 167.) If w, ~ 42, and w, ~ y, and if w, and w, 
are independent, n7 !w, n; w;! has the F distribution with n, and n; degrees of 
freedom, for which the density is given by 


foo = (n, /n;)2x^/271 
BIG, /2), (n; /2))(1 + n, x/n;o*»»? 


Symbolically we write 
w, [T 
— — ~ F(n,, n;). 
w/m ( 1 2) 


5. (Plackett, p. 24.) Let z ~ N(0, I). Let A and B be symmetric matrices. 
Then z' Az is independent of z' Bz if and only if AB = 0. 


6. (Plackett, p. 30.) Let z ~ N(0, I). Then c/z and z’ Az are independent if 
and only if Ac = 0. 


7. (Plackett, p. 30.) Let z ~ N(0, I) and let A be a symmetric matrix. Then 
z' Az/7' z and z'z are independent. 


Notes 


1. Classical Least Squares Theory 


1. This statement is true only when x? is a scalar. If y, x,, and x, are scalar dichoto- 
mous random variables, we have E(y|x, , X2) = By + BX, + 2X2 + f4x,x;, where the 
p's are appropriately defined. 

2. The fi that appears in (1.2.1) denotes the domain of the function S and hence, 
strictly speaking, should be distinguished from the parameter fl, which is unknown and 
yet can be regarded as fixed in value. To be precise, therefore, we should use two 
different symbols; however, we shall use the same symbol to avoid complicating the 
notation, so the reader must judge from the context which of the two meanings the 
symbol conveys. 

3. Again, the warning of note 2 is in order here. The ff and c? that appear in the 
likelihood function represent the domain of the function whereas those that appear in 
Model 1 are unknown fixed values of the parameters — so-called true values. 

4. The Cramér-Rao lower bound is discussed, for example, by Cox and Hinkley 
(1974), Bickel and Doksum (1977), Rao (1973), and Zacks (1971), roughly in ascend- 
ing order of sophistication. 

5. Good references for Bayesian statistics are Zellner (1971) and Box and Tiao 
(1973). For a quick review, Lindley (1972) is useful. 

6. Even in this situation some people prefer using the ¢ test. The procedure is also 
asymptotically correct, but there is no theoretical reason for the t test to be superior to 
the standard normal test in nonnormal cases. The same remark applies to the F test 
discussed later. See Pearson and Please (1975) for a discussion of the properties of the t 
and F tests in nonnormal cases. See White and MacDonald (1980) for tests of nonnor- 
mality using the least squares residuals. 

7. Chow (1960) considered the case where T, < K* and indicated how the subse- 
quent analysis can be modified to incorporate this case. 

8. This formula can easily be generalized to the situation where there are n regimes 
(n > 2). Simply combine n equations like (1.5.25) and (1.5.27) and calculate the sum 
of squared residuals from each equation. Then the numerator chi-square has 
(n — 1)K* degrees of freedom and the denominator chi-square has 27.,7;,— nK* de- 
grees of freedom. 

9. These two tests can easily be generalized to the case of n regimes mentioned in 
note 8. 
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10. If we have cross-section data, p refers to a cross-section unit not included in the 
sample. 

11. Theorem 1.6.1 is true even if we let C depend on x,. Then the two theorems are 
equivalent because any d satisfying (1.6.5) can be written as Cx, for some C such that 
C'X-I. 


2. Recent Developments in Regression Analysis 


1. A much more detailed account of this topic can be found in Amemiya (19802). 

2. We use the term estimator here, but all the definitions and the results of this 
subsection remain the same if we interpret d as any decision function mapping Y 
into 6. 

3. We assume that the losses do not depend on the parameters of the models. 
Otherwise, the choice of models and the estimation of the parameters cannot be 
sequentially analyzed, which would immensely complicate the problem. However, we 
do not claim that our assumption is realistic. We adopt this simplification for the 
purpose of illustrating certain basic ideas. 

4. For ways to get around this problem, see, for example, Akaike (1976) and 
Schwartz (1978). 

5. See Thisted (1976) for an excellent survey of this topic, as well as for some original 
results. More recent surveys are given by Draper and Van Nostrand (1979) and Judge 
and Bock (1983). 

6. The matrix H,Aj !Hj is sometimes referred to as the Moore-Penrose generalized 
inverse of X'X and is denoted by (X'X)*. See Rao (1973, p. 24) for more discussion of 
generalized inverses. 

7. This question was originally posed and solved by Silvey (1969)... 

8. Although Sclove considers the random coefficients model and the prediction 
(rather than estimation) of the regression vector, there is no essential difference be- 
tween his approach and the Bayesian estimation. 

9. What follows simplifies the derivation of Sclove et al. (1972). 

10. See Section 4.6.1. In Chapter 3 we shall discuss large sample theory and make 
the meaning of the term asymptotically more precise. For the time being, the reader 
should simply interpret it to mean “approximately when T is large." 

11. See the definition of probability limit in Chapter 3. Loosely speaking, the state- 
ment means that when T is large, s is close to 59 with probability close to 1. 


3. Large Sample Theory 


1. Representative textbooks are, in a roughly increasing order of difficulty, Hoel 
(1971); Freund (1971); Mood, Graybill, and Boes (1974); Cox and Hinkley (1974), and 
Bickel and Doksum (1977). 

2. Fora more complete study ofthe subject, the reader should consult Chung (1974) 
or Loéve (1977), the latter being more advanced than the former. Rao (1973), which is 
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an excellent advanced textbook in mathematical statistics, also gives a concise review 
of the theory of probability and random variables. 

3. If a set function satisfies only (i) and (iii) of Definition 3.1.2, it is called a 
measure. In this case the triplet defined in Definition 3.1.3 is called a measure space. 
Thus the theory of probability is a special case of measure theory. (A standard textbook 
for measure theory is Halmos, 1950.) 

4. If we have a measure space, then a function that satisfies the condition of Defini- 
tion 3.1.4 is called a measurable function. Thus a random variable is a function 
measurable with respect to the probability measure. 

5. Lebesgue measure can be defined also for certain non-Borel sets. The Lebesgue 
measure defined only for Borel sets is called Borel measure. 

6. For Y to be a random variable, h must satisfy the condition {q@|A[X(w)] € y) € A 
for every y. Such a function is said to be Borel-measurable. A continuous function 
except for a countable number of discontinuities is Borel-measurable. 

7. The following is a simple example in which the Riemann-Stieltjes integral does 
not exist. Suppose F(x) = 0 for a S x S c and F(x) = 1 forc « x s b and suppose 
h(x) = F(x). Then, depending on the point x* we choose in an interval that contains c, 
S, is equal to either 1 or 0. This is a weakness of the Riemann-Stieltjes integral. In this 
example the Lebesgue-Stieltjes integral exists and is equal to 0. However, we will 
not go into this matter further. The reader may consult references cited in Note 2 to 
Chapter 3. 

8. Sometimes we also say X, converges to X almost everywhere or with probability 
one. Convergence in probability and convergence almost surely are sometimes re- 
ferred to as weak convergence and strong convergence, respectively. 

9. Between M and a.s., we cannot establish a definite logical relationship without 
further assumptions. 

10. The law of large numbers implied by Theorem 3.2.1 (Chebyshev) can be slightly 
generalized so as to do away with the requirement of a finite variance. Let (X;) be 
independent and suppose E|X,|!*^ < M for some ó > 0 and some M < œ. Then X,— 
EX, => 0. This is called Markov's law of large numbers. 

11. The principal logarithm of a complex number re? is defined as log r + ið. 

12. It seems that for most practical purposes the weak consistency of an estimator is 
all that a researcher would need, and it is not certain how much more practical benefit 
would result from proving strong consistency in addition. 

13. Lai, Robbins, and Wei (1978) proved that if (u,) are assumed to be independent 
in Model | and if the conditions of Theorem 3.5.1 are met, the least squares estimator 
is strongly consistent. Furthermore, the homoscedasticity assumption can be relaxed, 
provided the variances of (u,) are uniformly bounded from above. 


4. Asymptotic Properties of Extremum Estimators 


1. In the proof of Theorem 4.1.1, continuity of Q7(6) is used only to imply continu- 
ity of Q(8) and to make certain the measurability of 87. Therefore we can modify this 
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theorem in such a way that we assume continuity of Q(@) but not of Q4(0) and define 
convergence in probability in a more general way that does not require measurability 
of 07. This is done in Theorem 9.6.1. 

Also note that the proof of Theorem 4.1.1 can easily be modified to show that ifthe 
convergence in the sense of (i) holds in assumption C, 8, converges to 0, almost surely. 

2. Strictly speaking, (4.1.10) is defined only for those 7’s for which Oris nonempty. 
However, the probability that Oz is nonempty approaches 1 as T goes to infinity 
because of Assumption C of Theorem 4.1.2. As an aside, Jennrich (1969) proved that 
0* is a measurable function. 

3. This proof is patterned after the proof of a similar theorem by Jennrich (1969). 

4. Note that if zy is merely defined as a random variable with mean 4(0) and 
variance-covariance matrix Z(0), the minimization of the quadratic form does not 
even yield a consistent estimator of 0. 

5. Theterm second-order efficiency is sometimes used to denote a related but differ- 
ent concept (see Pfanzagl, 1973). 

6. If the error term is multiplicative as in Q = f, KAL^e*, the log transformation 
reduces it to a linear regression model. See Bodkin and Klein (1967) for the estimation 
of both models. 

7. The methods of proof used in this section and Section 4.3.3 are similar to those of 
Jennrich (1969). 

8. Because a function continuous on a compact set is uniformly continuous, we can 
assume without loss of generality that {{ 8) is uniformly continuous in f E N. 

9. The nonsingularity is not needed here but is assumed as it will be needed later. 

10. Note that in the special case where the constraint h( f) = 0 is linear and can be 
written as Q’B = c, (4.5.21) is similar to (4.3.32). We cannot unequivocally determine 
whether the chi-square approximation of the distribution of (4.5.21) is better or worse 
than the F approximation of the distribution of (4.3.32). 

11. Ifthe sample is (1, 2, 3), 2 is the unique median. If the sample is (1, 2, 3, 4), any 
point in the closed interval [2, 3] may be defined as a median. The definition (4.6.3) 
picks 2 as the median. If f(x) > 0 in the neighborhood of x= M, this ambiguity 
vanishes as the sample size approaches infinity. 

12. The second term ofthe right-hand side of (4.6.10) does not affect the minimiza- 
tion but is added so that plim 7—'S; can be evaluated without assuming the existence 
of the first absolute moment of Y,. This idea originates in Huber (1965). 

13. Alternative methods of proof of asymptotic normality can be found in Bassett 
and Koenker (1978) and in Amemiya (1982a). 


5. Time Series Analysis 


1. We are using the approximation sign = to mean that most elements of the matri- 
ces of both sides are equal. 

2. The subscript p denotes the order of the autoregression. The size of the matrix 
should be inferred from the context. 
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3. Almon (1965) suggested an alternative method of computing f based on Lagran- 
gian interpolation polynomials. Cooper (1972) stated that although the method dis- 
cussed in the text is easier to understand, Almon's method is computationally su- 
perior. 


6. Generalized Least Squares Theory 


1. If rank (H;X) = K, f is uniquely determined by (6.1.2). 

2. Farebrother (1980) presented the relevant tables for the case in which there is no 
intercept. 

3. Breusch (1978) and Godfrey (1978) showed that Durbin's test is identical to 
Rao’s score test (see Section 4.5.1). See also Breusch and Pagan (1980) for the Lagrange 
multiplier test closely related to Durbin's test. 

4. In the special case in which N = 2 and fl changes with i (so that we must estimate 
both fj, and f,), statistic (6.5.9) is a simple transformation of the F statistic (1.5.44). In 
this case (1.5.44) is preferred because the distribution given there is exact. 

5. It is not essential to use the unbiased estimators 5? here. If 62 are used, the 
distribution of j is only trivially modified. — 

6. In either FGLS or MLE we can replace fJ, by f, without affecting the asymptotic 
distribution. 

7. Hildreth and Houck suggested another estimator (Z/MZ) !Z/&?, which is the 
instrumental variables estimator applied to (6.5.23) using Z as the instrumental vari- 
ables. Using inequality (6.5.29), we can show that this estimator is also asymptotically 
less efficient than à. See Hsiao (1975) for an interesting derivation of the two estima- 
tors of Hildreth and Houck from the MINQUE principle of Rao (1970). Froehlich 
(1973) suggested FGLS applied to (6.5.23). By the same inequality, the estimator can 
be shown to be asymptotically less efficient than à. Froehlich reported a Monte Carlo 
study that compared the FGLS subject to the nonnegativity of the variances with 
several other estimators. A further Monte Carlo study has been reported by Dent and 
Hildreth (1977). 

8. The symbol | here denotes an NT-vector of ones. The subscripts will be omitted 
whenever the size of l is obvious from the context. The same is true for the identity 
matrix. a 

9. For the consistency of fo we need both N and T to go to ~. For further discus- 
sion, see Anderson and Hsiao (1981, 1982). 

10. The presence of the vector f in a variance term would cause a problem in the 
derivation of the asymptotic results if T were allowed to go to o, but in the Lillard- 
Weiss model, as in most of econometric panel-data studies, T is small and N is large. 


7. Linear Simultaneous Equations Models 


1. More general constraints on the elements of I and B have been discussed by 
Fisher (1966) and Hsiao (1983). 
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2. For alternative definitions of identification, see Hsiao (1983). 

3. Identification of the structural parameters may be possible because of constraints 
on the variance-covariance matrix of the errors (see Fisher, 1966, or Hsiao, 1983). 

4. For an alternative derivation, see Koopmans and Hood (1953). 

5. Basmann (1957) independently proposed the same estimator and called it the 
generalized classical linear estimator. 

6. Exact distributions and densities are expressed as twofold or threefold infinite 
series and therefore are difficult to work with. For this reason approximations based on 
various asymptotic expansions have been proposed. They are mainly classified into 
two types: one based on a Taylor expansion of the logarithm of the characteristic 
function, as in Anderson and Sawa (1973), and one that is a direct expansion of the 
probability, as in Anderson (1974). 

7. Aninstrumental variables estimator satisfying conditions (i) and (ii) but not (iii) 
is consistent and could be asymptotically more efficient than 2SLS. 


8. Nonlinear Simultaneous Equations Models 


1. A slightly more general model defined by f(y,, Y,, Xu, Qo) = u, will be consid- 
ered in Section 8.1.2. 

2. Tsurumi (1970) estimated a CES production function by first linearizing the 
function around certain initial estimates of the parameters and then proceeding as if he 
had the model nonlinear only in variables. Thus his method was in effect a Gauss- 
Newton iteration for obtaining NL2S. However, Tsurumi did not discuss the statistical 
properties of the estimator he used. Applications of models nonlinear only in variables 
can be found in Strickland and Weiss (1976) and Rice and Smith (1977). 

3. The subscript 0 indicating the true value is henceforth suppressed to simplify the 
notation. 

4. Goldfeld and Quandt (1968) considered a two-equation model nonlinear only in 
variables that yields two solutions of dependent variables for a given vector value ofthe 
error terms. For this model they conducted a Monte Carlo study of how the perform- 
ance of various estimators is affected by different mechanisms of choosing a unique 
solution. 

5. This lemma, stated slightly differently, is originally attributed to Stein (1973). 
Amemiya (1977a) independently rediscovered the lemma. The proof can be found in 
an article by Amemiya (1982b). 


9. Qualitative Response Models 


1. If limao [g,(x + A) — &,00]/A = dg,/dx uniformly in n, d(lim,... 2,)/dx = 
lim, ,. dg, /dx. See, for example, Apostol (1974, p. 221) for a proof. 

2. The asymptotic efficiency of Berkson's MIN 7? estimator can also be proved as a 
corollary of the Barankin and Gurland theorem quoted in Section 4.2.4. See Taylor 
(1953). 
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3. There is a slight error in the formula for the mean squared error of MLE, which 
has been corrected by Amemiya (1984b). 

4. See Anderson (1958, Ch. 6) for a thorough discussion of discriminant analysis. 

5. Certain areas of statistics are more amenable to Bayesian analysis than to classical 
analysis. The simultaneous determination of prediction and estimation is one of them. 

6. To simplify the notation we shall omit the subscript 0 that denotes the true value. 
The reader should be able to understand from the context whether a symbol denotes 
the true value of a parameter or the domain of the parameter space. 

7. The possible inconsistency due to this approximation was briefly discussed at the 
end of Section 9.2.5. 

8. Deacon and Shapiro actually used (9.3.32) and the equation obtained by sum- 
ming (9.3.32) and (9.3.33). The resulting estimates of f, and £, are the same as those 
obtained by the method described in the text. 

9. The triple integral in (9.3.72) can be reduced to a double integral by a certain 
transformation. In general, we must evaluate m-tuple integrals for m + 1 responses. 

10. We assume that a component of the vector x is either discrete or continuous, so 
that f is actually the density and the probability combined (density with respect to 
some measure). Thus the integration with respect to x that will appear later should be 
regarded as the integral and summation combined (the Lebesgue-Stieltjes integral with 
respect to the appropriate measure). 

11. Because g(x;) is known, it can be ignored in the maximization of L,. However, 
we retained it to remind the reader of the sampling scheme. In (9.5.2), H( jj) is retained 
for the same reason. a 

12. The main difference between the two proofs is that we define f, as a solution of 
(9.5.6) whereas Manski and Lerman define it as the value of f that attains the global 
maximum of (9.5.5) over a compact parameter space containing the true value. 

13. The following analysis can be modified to allow for the possibility that for j and 
x, P(j|x, 8) = 0 for all f. Such a case arises when certain alternatives are unavailable for 
a certain individual. See Manski and. McFadden (1983, footnote 23, p. 13). 

14. Manski and Lerman quote an interesting result attributed to McFadden, which 
states that in a multinomial logit model with alternative-specific intercepts— that is, 
the model in which a in (9.3.43) varies with j as in (9.2.4) and (9.2.5), the inconsistency 
is confined to the parameters (or). 

15. In the simple logit model considered after (9.5.22), it can be shown that the 
asymptotic variances of MME and WMLE are identical, so that Table 9.7 applies to 
MME as well. 

16. Cosslett's sampling scheme can be generalized further to yield a general strati- 
fied sampling (see Manski and McFadden, 1983, p. 28). 

17. This interpretation does not contradict the fact that in actual decisions the 
determination of x precedes that of j. 

18. There is a subtle difference. In the simple choice-based sampling defined earlier, 
a person choosing alternative j is sampled with probability H( j), whereas in the gener- 
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alized choice-based sampling the number of people in the subsample s is fixed and not 
random. 

19. It is interesting to note that maximizing Q with respect to J and 4, is equivalent 
to maximizing V defined in (9.5.39) with respect to f and Qq(j). Manski and McFad- 
den (1983, p. 24) suggested this version of MME without realizing its equivalence to 
CBMLE. 

20. Cosslett defined WMLE and MME using the actual proportions of people 
choosing j instead of H( j). 

21. Manski's proof is not complete because in the fourth line from the bottom on 
page 218 of his article, double limit operations are interchanged without verifying the 
necessary conditions. It seems that we would have to make more assumptions than 
made by Manski in order for the necessary conditions to hold. A correct proof for the 
binary case can be found in Manski (1985). 

22. Manski (1975) considered a more general score function than that defined here. 

23. The heterogeneity problem is also known as the mover-stayer problem in the 
literature of Markov chain models. Among the first to discuss the problem were 
Blumen, Kogan, and McCarthy (1955), who found that individuals who changed 
occupations most frequently in the past were more likely to change in the future. 

24. The (u,) in (9.7.5) are not i.i.d., unlike the (u,) in (9.7.2). 


10. Tobit Models 


1. More precisely, £ means in this particular case that both sides of the equation 
multiplied by Vm have the same limit distribution. 

2. A( + Jis known as the hazard rate and its reciprocal is known as Mills's ratio. Tobin 
(1958) has given a figure that shows that A(z) can be closely approximated by a linear 
function of z for — 1 < z < 5. Johnson and Kotz (1970a, p. 278.) gave various expan- 
sions of Mills' ratio. 

3. Chung and Goldberger (1984) generalized the results of Goldberger (1981) and 
Greene (1981) to the case where y* and x are not necessarily jointly normal but E(x|y*) 
is linear in y*. 

4. This was suggested by Wales and Woodland (1980). 

5. See Stapleton and Young (1984). 

6. See Section 4.3.5. Hartley (1976b) proved the asymptotic normality of jy and 
Jxw and that they are asymptotically not as efficient as the MLE. 

7. The asymptotic equivalence of fn and ¥ was proved by Stapleton and Young 
(1984). 

8. Amemiya (1973c) showed that the Tobit likelihood function is not globally 
concave with respect to the original parameters f and c?. 

9. For an alternative account, see Hartley (1976c). 

10. An inequality constraint like this is often necessary in simultaneous equations 
models involving binary or truncated variables. For an interesting unified approach to 
this problem, see Gourieroux, Laffont, and Monfort (1980). 
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11. See Cragg (1971) for models that ensure the nonnegativity of y, as well as of y, . 

12. For a more elaborate derivation of the reservation wage model based on search 
theory, see Gronau (1974). 

13. Gronau specified that the independent variables in the W” equation include a 
woman's age and education, family income, number of children, and her husband's 
age and education, whereas the independent variables in the W° equation include only 
a woman's age and education. However, Gronau readily admitted to the arbitrariness 
of the specification and the possibility that all the variables are included in both. 

14. Gronau assumed the independence between u, and v and used an estimator 
different from those mentioned here. Amemiya (1984a) pointed out an error in 
Gronau's procedure. The independence between u, and v is unnecessary if we use 
either the MLE or Heckman's two-step estimator. 

15. Although Heckman's model (1974) is a simultaneous equations model, the 
two-step estimator of Heckman studied by Wales and Woodland is essentially a re- 
duced-form estimator, which we have discussed in this subsection, rather than the 
structural equations version we shall discuss in the next subsection. 

16. For a panel data generalization of Heckman's model, see Heckman and Ma- 
Curdy (1980). 

17. Actually, Heckman used log W" and log W^. The independent variables x, 
include husband's wage, asset income, prices, and individual characteristics and z 
includes housewife's schooling and experience. 

18. A more explicit expression for the likelihood function was obtained by Ame- 
miya (1974a), who pointed out the incorrectness of the likelihood function originally 
given by Fair and Jaffee. 

19. Equation (10.10.26) is the maximized profit function and (10.10.27) is an input 
demand or output supply function obtained by differentiating (10.10.26) with respect 
to the own input or output price (Hotelling's lemma). For convenience only oneinput 
or output has been assumed; so, strictly speaking, x(? and x@ are scalars. 

20. These two equations correspond to the two equations in a proposition of Dun- 
can (1980, p. 851). It seems that Duncan inadvertently omitted the last term from 
(10.10.30). 


11. Markov Chain and Duration Models 


1. Note that this result can be obtained from (11.1.38) using the Boskin-Nold 
assumptions as well as the assumption that T goes to o. The result (11.1.38) is valid 
even if T does not go to © provided that NT goes to œ. If we assumed p(O) = p(o), the 
approximate equality in (11.1.55) would be exact and hence (11.1.59) would be valid 
without assuming T — © provided that NT —> o. 

2. Actually, Toikka is interested only in three out of the six transition probabilities, 
and he lets those three depend linearly on exogenous variables. However, for the 
simplicity of analysis, we shall proceed in our discussion as if all the six transition 
probabilities depended linearly on exogenous variables. 
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3. Note that minimizing (11.1.78) is equivalent to maximizing the pseudo likeli- 
hood function under the assumption that r, is normally distributed. 

4. The likelihood function (11.2.14) pertains only to unemployment spells. When 
we also observe employment spells, (1 1.2.14) should be multiplied by a similar expres- 
sion for employment spells. When there is no common unknown parameter in both 
parts of the likelihood function, each part can be separately maximized. 

5. See Gradshteyn and Ryzhik (1965, p. 576). 

6. See Bellman (1970, Chapter 10). 

7. Here we have simplified Tuma's model slightly to bring out the essential points. 

8. To integrate out v, we must specify its distribution as Lancaster did. A wrong 
distribution would introduce a specification bias. Heckman and Singer (1982, 1984a) 
discussed an interesting procedure of treating the distribution of v as unknown and 
maximizing the likelihood function with respect to that distribution as wellas the other 
unknown parameters of the model. 

9. See Kalbfleisch and Prentice (1980) for a procedure for modifying the following 
analysis in the case of ties. 

10. Let L(y, 6) bea joint density function (ora likelihood function if it is considered 
as a function of 6) of a vector of random variables y. If y is transformed into (v, w) by a 
transformation not dependent on @, the conditional density L(v|w, 0) is called a condi- 
tional likelihood function. 


References 


Abramowitz, M., and I. A. Segun. 1965. Handbook of Mathematical Functions. New 
York: Dover Publishing. 

Adams, J. D. 1980. “Personal Wealth Transfers." Quarterly Journal of Economics 
95:159-179. 

Adelman, I. G. 1958. “A Stochastic Analysis of the Size Distribution of Firms.” 
Journal of the American Statistical Association 53:893--904. 

Aigner, D. J., T. Amemiya, and D. J. Poirier. 1976. “On the Estimation of Production 
Frontiers: Maximum Likelihood Estimation of the Parameters of a Discontin- 
uous Density Function." International Economic Review 17:377 —396. 

Aigner, D. J., and G. G. Judge. 1977. “Application of Pre-Test and Stein Estimators to 
Economic Data.” Econometrica 45:1279 — 1288. 

Aitchison, J., and J. Bennett. 1970. “Polychotomous Quantal Response by Maximum 
Indicant.” Biometrika 57:253—262. 

Akahira, M. 1983. “Asymptotic Deficiency of the Jackknife Estimator.” The Austra- 
lian Journal of Statistics 25:123-129. 

Akaike, H. 1973. “Information Theory and an Extension of the Maximum Likelihood 
Principle,” in B. N. Petrov and F. Csaki, eds., Second International Symposium 
on Information Theory, pp. 267-281. Budapest: Akademiai Kiado. 

1976. “On Entropy Maximization Principle.” Paper presented at the Sympo- 
sium on Applications of Statistics, Dayton, Ohio. 

Albert, A., and J. A. Anderson. 1984. “On the Existence of Maximum Likelihood 
Estimates in Logistic Regression Models.” Biometrika 71:1-10. 

Albright, R. L., S. R. Lerman, and C. F. Manski. 1977. "Report on the Development of 
an Estimation Program for the Multinomial Probit Model." Mimeographed 
paper prepared for the Federal Highway Administration. 

Almon, S. 1965. “The Distributed Lag between Capital Appropriations and Expendi- 
tures." Econometrica 33:178 — 196. 

Amemiya, T. 1966. “On the Use of Principal Components of Independent Variables 
in Two-Stage Least-Squares Estimation.” International Economic Review 7:283- 
303. 

1967. “A Note on the Estimation of Balestra-Nerlove Models." Technical 

Report no. 4, Institute for Mathematical Studies in the Social Sciences, Stanford 

University, Calif. 


476 References 


1971. “The Estimation of the Variances in a Variance-Components Model.” 
International Economic Review 12:1-13. 

1972. “Bivariate Probit Analysis: Minimum Chi-Square Methods.” Journal of 
the American Statistical Association 69:940 -944. 

1973a. Generalized Least Squares with an Estimated Autocovariance Ma- 
trix.” Econometrica 41:723 - 732. 

1973b. "Regression Analysis When the Variance of the Dependent Variable Is 
Proportional to the Square of Its Expectation.” Journal of the American Statistical 
Association 68:928 -934. 

1973c. "Regression Analysis When the Dependent Variable Is Truncated 
Normal.” Econometrica 41:997 — 1016. 

1974a. “A Note on a Fair and Jaffee Model.” Econometrica 42:759 —762, 
1974b. “Multivariate Regression and Simultaneous Equation Models When 
the Dependent Variables Are Truncated Normal." Econometrica 42:999 — 1012. 
1974c. “The Nonlinear Two-Stage Least-Squares Estimator.” Journal of Econ- 
ometrics 2:105 — 110. l 

1975a. “The Nonlinear Limited-Information Maximum-Likelihood Estima- 
tor and the Modified Nonlinear Two-Stage Least-Squares Estimator.” Journal of 
Econometrics 3:375- 386. 

1975b. “Qualitative Response Models.” Annals of Economic and Social Mea- 
surement 4:363—372. 

1976a. "Estimation in Nonlinear Simultaneous Equation Models." Paper 
presented at and published by Institut National de la Statistique et des Etudes 
Economiques, Paris. Published in French: E. Malinvaud, ed., Cahiers du sémin- 
aire d'econometrie, no. 19, 1978, 

1976b. “The Maximum Likelihood, the Minimum Chi-Square and the Non- 
linear Weighted Least-Squares in the General Qualitative Response Model." 
Journal of the American Statistical Association 71:347 - 351. 

1977a. “The Maximum Likelihood and the Nonlinear Three-Stage Least 
Squares Estimator in the General Nonlinear Simultaneous Equation Model.” 
Econometrica 45:955 —968. 

1977b. “A Note on a Heteroscedastic Model." Journal of Econometrics 
6:365- 370. 

1977c. “The Modified Second-Round Estimator in the General Qualitative 
Response Model.” Journal of Econometrics. 5:295 - 299. 

19782. “Corrigenda: A Note on a Heteroscedastic Model." Journal of Econo- 
metrics 8:265. 

1978b. “A Note on a Random Coefficients Model." International Economic 
Review 19:793 —796. 

1978c. "The Estimation of a Simultaneous Equation Generalized Probit 
Model." Econometrica 46:1193- 1205. 


References 477 


1979. “The Estimation of a Simultaneous-Equation Tobit Model.” Interna- 

tional Economic Review 20:169-181. 

1980a. “Selection of Regressors." International Economic Review 21:331- 

345. 

1980b. “The n7?-order Mean Squared Errors ofthe Maximum Likelihood and 

the Minimum Logit Chi-Square Estimator." Annals of Statistics 8:488 - 505. 

1981. "Qualitative Response Models: A Survey." Journal of Economic Litera- 

ture 19:1483- 1536. 

1982a. “Two Stage Least Absolute Deviations Estimators.” Econometrica 
50:689-711. 

— —— 1982b. “Correction to a Lemma.” Econometrica 50:1325— 1328. 

1983a. “Non-Linear Regression Models,” in Z. Griliches and M. D. Intrilliga- 

tor, eds., Handbook of Econometrics, 1:333-389. Amsterdam: North-Holland 

Publishing. 

1983b. “A Comparison of the Amemiya GLS and the Lee-Maddala-Trost 

G2SLS in a Simultaneous-Equations Tobit Model.” Journal of Econometrics 

23:295 — 300. 

1983c. “Partially Generalized Least Squares and Two-Stage Least Squares 

Estimators.” Journal of Econometrics. 23:275—- 283. 

1984a. “Tobit Models: A Survey.” Journal of Econometrics 24:3—61. 

1984b. “Correction.” Annals of Statistics 12:783. 

Amemiya, T., and M. Boskin. 1974. “Regression Analysis When the Dependent Vari- 
able Is Truncated Lognormal, with an Application to the Determinants of the 
Duration of Welfare Dependency." International Economic Review 15:485 —496. 

Amemiya, T., and W. A. Fuller. 1967. *A Comparative Study of Alternative Estima- 
tors in a Distributed-Lag Model.” Econometrica 35:509 — 529. 

Amemiya, T., and T. E. MaCurdy. 1983. "Instrumental Variable Estimation of an 
Error Components Model." Technical Report no. 414, Institute for Mathemati- 
cal Studies in the Social Sciences, Stanford University, Calif. (Forthcoming in 
Econometrica.) 

Amemiya, T., and K. Morimune. 1974. “Selecting the Optimal Order of Polynomial 
in the Almon Distributed Lag." Review of Economics and Statistics 56:378 — 386. 

Amemiya, T., and J. L. Powell. 1981. "A Comparison of the Box-Cox Maximum 
Likelihood Estimator and the Non-Linear Two-Stage Least Squares Estimator." 
Journal of Econometrics 17:351 381. 

1983. *A Comparison of the Logit Model and Normal Discriminant Analysis 
When the Independent Variables Are Binary," in S. Karlin, T. Amemiya, and 
L. A. Goodman, eds., Studies in Econometric, Time Series, and Multivariate 
Statistics, pp.3- 30. New York: Academic Press. 

Anderson, T. W. 1958. Introduction to Multivariate Statistical Analysis. New York: 
John Wiley & Sons. 


478 References 


— 1969. “Statistical Inference for Covariance Matrices with Linear Structure,” in 
P. R. Krishnaiah, ed., Proceedings of the Second International Symposium on 
Multivariate Analysis, pp. 55-66. New York: Academic Press. 

1971. The Statistical Analysis of Time Series. New York: John Wiley and 

Sons. 

1974.“An Asymptotic Expansion of the Distribution of the Limited Informa- 

tion Maximum Likelihood Estimate of a Coefficient in a Simultaneous Equation 

System." Journal of the American Statistical Association 60:565 - 573. 

1982. “Some Recent Developments of the Distributions of Single-Equation 
Estimators,” in W. Hildenbrand, ed., Advances in Econometrics pp. 109-122. 
Cambridge: Cambridge Univerity Press. 

Anderson, T. W., and L. A. Goodman. 1957. “Statistical Inference about Markov 
Chains.” Annals of Mathematical Statistics 28:89 - 110. 

Anderson, T. W., and C. Hsiao. 1981. “Estimation of Dynamic Models with Error 
Components." Journal of the American Statistical Association 76:598 - 606. 
1982. "Formulation and Estimation of Dynamic Models Using Panel Data." 

Journal of Econometrics 18:47 - 82. 

Anderson, T. W., and H. Rubin. 1949. “Estimator of the Parameters of a Single 
Equation in a Complete System of Stochastic Equations." Annals of Mathemati- 
cal Statistics 20:46 - 63. 

Anderson, T. W., and T. Sawa. 1973. “Distributions of Estimates of Coefficients of a 
Single Equation in a Simultaneous System and Their Asymptotic Expansions.” 
Econometrica 41:683 - 714. 

Andrews, D. F. 1974. *A Robust Method for Multiple Linear Regression." Techno- 
metrics 16:523-531. 

Andrews, D. F., P. J. Bickel, F. R. Hampel, P. J. Huber, W. H. Rogers, and J. W. 
Tukey. 1972. Robust Estimates of Location. Princeton: Princeton University 
Press. 

Anscombe, F. J. (1961). "Examination of Residuals,” in J. Neyman, ed., Proceedings 
of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 
1:1—36. Berkeley: University of California Press. 

Apostol, T. M. 1974. Mathematical Analysis, 2nd ed. Reading, Mass.: Addison- 
Wesley. 

Arabmazar, A., and P. Schmidt. 1981. “Further Evidence on the Robustness of the 
Tobit Estimator to Heteroscedasticity.” Journal of Econometrics 17:253-258. 

1982. “An Investigation of the Robustness of the Tobit Estimator to Non-Nor- 
mality.” Econometrica 50:1055 — 1063. 

Arrow, K. J., H. B. Chenery, B. S. Minhas, and R. M. Solow. 1961. 'Capital-Labor 
Substitution and Economic Efficiency.” Review of Economics and Statistics 
43:225 - 250. 

Ashenfelter, O., and J. Ham. 1979. “Education, Unemployment, and Earnings." 
Journal of Political Economy 87:899 -S116. 


References 479 


Ashford, J. R., and R. R. Sowden. 1970. “Multivariate Probit Analysis.” Biometrics 
26:535-546. 

Baldwin, R. E. 1971. “Determinants of the Commodity Structure of U.S. Trade.” 
American Economic Review 61:126 —146. 

Balestra, P., and M. Nerlove. 1966. “Pooling Cross-Section and Time Series Data in 
the Estimation of a Dynamic Model: The Demand for Natural Gas.” Economet- 
rica 34:585 -612. 

Baltagi, B. H. 1981. “Pooling: An Experimental Study of Alternative Testing and 
Estimation Procedures in a Two-Way Error Component Model." Journal of 
Econometrics 17:21 -49. 

Baranchik, A. J. 1970. *A Family of Minimax Estimators of the Mean of a Multivar- 
iate Normal Distribution." Annals of Mathematical Statistics 41:642 -645. 
Barankin, E. W., and J. Gurland. 1951. “On Asymptotically Normal Efficient Estima- 

tors: I.” University of California Publications in Statistics 1:86 — 130. 

Bartholomew, D. J. 1982. Stochastic Models for Social Processes, 3rd ed. New York: 
John Wiley & Sons. 

Basmann, R. L. 1957. “A Generalized Classical Method of Linear Estimation of 
Coefficients in a Structural Equation.” Econometrica 25:77 —83. 

Bassett, G., Jr., and R. Koenker. 1978. “Asymptotic Theory of Least Absolute Error 
Regression.” Journal of the American Statistical Association 73:618—622. 

Beach, C. M., and J. G. MacKinnon. 1978. “A Maximum Likelihood Procedure for 
Regression with Auto-correlated Errors,” Econometrica 46:51-58. 

Bellman, R. 1970. Introduction to Matrix Analysis, 2d ed. New York: McGraw-Hill. 

Belsley, D. A. 1979. “On the Computational Competitiveness of Full-Information 
Maximum-Likelihood and Three-Stage Least-Squares in the Estimation of Non- 
linear Simultaneous-Equations Models." Journal of Econometrics 9:315 - 342. 

Benus, J., J. Kmenta, and H. Shapiro. 1976. “The Dynamics of Household Budget 
Allocation to Food Expenditures." Review of Economics and Statistics 58:129 — 
138. 

Bera, A. K., C. M. Jarque, and L. F. Lee. 1982. “Testing for the Normality Assumption 
in Limited Dependent Variable Models." Mimeographed paper, Department of 
Economics, University of Minnesota. 

Berger, J. O. 1975. “Minimax Estimation of Location Vectors for a Wide Class of 
Densities.” Annals of Statistics 3:1318—1328. 

1976. “Admissible Minimax Estimation of a Multivariate Normal Mean with 
Arbitrary Quadratic Loss." Annals of Statistics 4:223-226. 

Berkson, J. 1944. “Application of the Logistic Function to Bio-Assay.” Journal of the 
American Statistical Association 39:357 365. 

1955. "Maximum Likelihood and Minimum 7? Estimates of the Logistic 

Function.” Journal of the American Statistical Association 50:130— 162. 

1957. “Tables for Use in Estimating the Normal Distribution Function by 

Normit Analysis." Biometrika 44:411—435. 


480 References 


1980. “Minimum Chi-Square, Not Maximum Likelihood!” Annals of Statis- 
tics 8:557 -487. 

Berndt, E. R., B. H. Hall, R. E. Hall, and J. A. Hausman. 1974. “Estimation and 
Inference in Nonlinear Structural Models." Annals of Economic and Social Mea- 
surement 3:653 - 666. 

Berndt, E. R., and N. E. Savin. 1977. “Conflict among Criteria for Testing Hypotheses 
in the Multivariate Linear Regression Model.” Econometrica 45:1263— 1278. 

Berzeg, K. 1979. “The Error Components Model: Conditions for the Existence of the 
Maximum Likelihood Estimates.” Journal of Econometrics 10:99 — 102. 

Bhattacharya, P. K. 1966. “Estimating the Mean ofa Multivariate Normal Population 
with General Quadratic Loss Function.” Annals of Mathematical Statistics 
32:1819- 1824. 

Bhattacharya, R. N., and R. R. Rao. 1976. Normal Approximation.and Asymptotic 
Expansions. New York: John Wiley & Sons. 

Bianchi, C., and G. Calzolari. 1980. “The One-Period Forecast Error in Nonlinear 
Econometric Models." International Economic Review 21:201 —208. 

Bickel, P. J. 1975. “One-Step Huber Estimation in the Linear Model." Journal of the 
American Statistical Association 70:428 —433. 

1978. "Using Residuals Robustly, I: Tests for Heteroscedasticity, Nonlinearity 
and Nonadditivity." Annals of Statistics 6:266 -291. 

Bickel, P. J., and K. A. Doksum. 1977. Mathematical Statistics: Basic Ideas and 
Selected Topics. San Francisco: Holden-Day. 

Bishop, Y. M. M, S. E. Fienberg, and P. W. Holland. 1975. Discrete Multivariate 
Analysis, Theory and Practice. Cambridge, Mass.: MIT Press. 

Blattberg, R., and T. Sargent. 1971. "Regression with Non-Gaussian Stable Distur- 
bances: Some Sampling Results." Econometrica 39:501—510. 

Blumen, I., M. Kogan, and P. J. McCarthy. 1955. The Industrial Mobility of Labor as 
a Probability Process. Ithaca, N.Y.: Cornell University Press. 

Bock, M. E. 1975. “Minimax Estimators of the Mean of a Multivariate Distribution." 
Annals of Statistics 3:209 - 218. 

Bodkin, R. G., and L. R. Klein. 1967. "Nonlinear Estimation of Aggregate Production 
Functions.” Review of Economics and Statistics 49:28 - 44. 

Borjas, G. J., and S. Rosen. 1980. "Income Prospects and Job Mobility of Young 
Men." Research in Labor Economics 3:159 — 181. 

Boskin, M. J., and F. C. Nold. 1975. “A Markov Model of Turnover in Aid to Families 
with Dependent Children." Journal of Human Resources 10:476 - 481. 

Box, G. E. P., and D. R. Cox. 1964. "An Analysis of Transformations.” Journal of the 
Royal Statistical Society ser. B, 26:211—252 (with discussion). 

Box, G. E. P., and G. M. Jenkins. 1976. Time Series Analysis: Forecasting and Con- 
trol, rev. ed. San Francisco: Holden-Day. 

Box, G. E. P., and G. C. Tiao. 1973. Bayesian Inference in Statistical Analysis. Read- 
ing, Mass.: Addison-Wesley. 


References 481 


Breusch, T. S. 1978. “Testing for Autocorrelation in Dynamic Linear Models." Aus- 
tralian Economic Papers 17:334 - 335. 

Breusch, T. S., and A. R. Pagan. 1979. “A Simple Test for Heteroscedasticity and 
Random Coefficient Variation.” Econometrica 47:1287 - 1294. 

1980. “The Lagrange Multiplier Test and Its Applications to Model Specifica- 
tion in Econometrics.” Review of Economic Studies 47:239-253. 

Brillinger, D. R. 1975. Time Series: Data Analysis and Theory. New York: Holt, 
Rinehart, and Winston. 

Brook, R. J. 1976. “On the Use of a Regret Function to Set Significance Points in Prior 
Tests of Estimation." Journal of the American Statistical Association 71:126 — 
131. 

Brown, B. W. 1983. “The Identification Problem in Systems Nonlinear in the Vari- 
ables.” Econometrica 51:175-196. 

Brown, B. W., and R. S. Mariano. 1982. "Residual-Based Stochastic Prediction in a 
Nonlinear Simultaneous System.” Analysis Center, The Wharton School, Uni- 
versity of Pennsylvania. 

Brown, M., and D. Heien. 1972. “The S-Branch Utility Tree: A Generalization of the 
Linear Expenditure System.” Econometrica 40:737 -747. 

Brown, P., and C. Payne. 1975. “Election Night Forecasting.” Journal of the Royal 
Statistical Society ser. A, 138:463-498 (with discussion). 

Brown, R. L., J. Durbin, and J. M. Evans. 1975. *Techniques for Testing the Con- 
stancy of Regression Relationships over Time." Journal of the Royal Statistical 
Society ser. B, 37:149 - 192 (with discussion). 

Brown, W. G., and B. R. Beattie. 1975. "Improving Estimates of Economic Parame- 
ters by Use of Ridge Regression with Production Function Applications." Ameri- 
can Journal of Agricultural Economics 57:21 -32. 

Brundy, J. M., and D. W. Jorgenson. 1974. "Consistent and Efficient Estimation of 
Systems of Simultaneous Equations by Means of Instrumental Variables," in P. 
Zarembka, ed., Frontiers in Econometrics, pp. 215-244. New York: Academic 
Press. 

Buse, A. 1982. “Tests for Additive Heteroscedasticity: Some Monte Carlo Results." 
Research Paper no. 82-13, Department of Economics, University of Alberta. 

Butler, J. S., and R. Moffitt. 1982. “A Computationally Efficient Quadrature Proce- 
dure for the One-Factor Multinomial Probit Model.” Econometrica 50:76 1-764. 

Carroll, R. J., and D. Ruppert. 1982a. "Robust Estimation in Heteroscedastic Linear 
Models." Annals of Statistics 10:429-441. 

—— — 1982b. “A Comparison between Maximum Likelihood and Generalized Least 
Squares in a Heteroscedastic Linear Model." Journal of the American Statistical 
Association 77:878 — 882. 

Chamberlain, G. 1982. “Multivariate Regression Models for Panel Data." Journal of 
Econometrics 18:5-46. 

Chamberlain, G., and Z. Griliches. 1975. "Unobservables with a Variance-Compo- 


482 References 


nents Structure: Ability, Schooling and the Economic Success of Brothers.” Inter- 
national Economic Review 16:422-429. 

Champernowne, D. G. 1953. “A Model of Income Distribution.” Economic Journal 
63:318 -351. 

Charatsis, E. G. 1971. “A Computer Program for Estimation of the Constant Elasticity 
of Substitution Production Function." Applied Statistics 20:286 - 296. 

Chow, G. C. 1960. “Tests for Equality between Sets of Coefficients in Two Linear 
Regressions.” Econometrica 28:59] - 605. 

1968. “Two Methods of Computing Full-Information Maximum Likelihood 

Estimates in Simultaneous Stochastic Equations." International Economic Re- 

view 9:100- 112. 

1973, "On the Computation of Full-Information Maximum Likelihood Esti- 
mates for Nonlinear Equation Systems." Review of Economics and Statistics 
55:104— 109. 

Chow, G. C., and R. C. Fair. 1973. "Maximum Likelihood Estimation of Linear 
Equation Systems with Auto-Regressive Residuals.” Annals of Economic and 
Social Measurement 2:17-28. 

Christ, C. F. 1966. Econometric Models and Methods. New York: John Wiley & Sons. 

Christensen, L. R., D. W. Jorgenson, and L. J. Lau. 1975. “Transcendental Logarith- 
mic Utility Functions.” American Economic Review 65:367 - 383. 

Chung, C. F., and A. S. Goldberger. 1984. “Proportional Projections in Limited 
Dependent Variable Models." Econometrica 52:531 — 534. 

Chung, K. L. 1974. 4 Course in Probability Theory, 2d ed. New York: Academic Press. 

Clark, C. 1961. “The Greatest of a Finite Set of Random Variables.” Operations 
Research 9:145 —-162. 

Cochrane, D., and G. H. Orcutt. 1949. "Application of Least Squares Regression to 
Relationships Containing Autocorrelated Error Terms." Journal of the American 
Statistical Association 44:32 -61. 

Cooley, T. F., and E. C. Prescott. 1976. “Estimation in the Presence of Stochastic 
Parameter Variation." Econometrica 44:167 - 184. 

Cooper, J. P. 1972. “Two Approaches to Polynomial Distributed Lags Estimation: An 
Expository Note and Comment." The American Statistician 26:32-35. 

Cosslett, S. R. 1978. “Efficient Estimation of Discrete-Choice Models from Choice- 
Based Samples." Workshop in Transportation Economics, University of Califor- 
nia, Berkeley. 

1981a. “Maximum Likelihood Estimator for Choice-Based Samples." Econo- 

metrica 49:1289- 1316. 

1981b. ‘Efficient Estimation of Discrete-Choice Models,” in C. F. Manski and 

D. McFadden, eds., Structural Analysis of Discrete Data with Econometric Appli- 

cations, pp. 51-111. Cambridge, Mass.: MIT Press. 

1983. "Distribution-Free Maximum Likelihood Estimator of the Binary 

Choice Model." Econometrica 51:765 —782. 


References 483 


Cox, D. R. 1961. “Tests of Separate Families of Hypotheses,” in J. Neyman, ed., 
Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and 
Probability, 1:105 - 123. Berkeley: University of California Press. 

1962. “Further Results on Tests of Separate Families of Hypotheses.” Journal 

of the Royal Statistical Society ser. B, 24:406 -424. 

1966. "Some Procedures Connected with the Logistic Qualitative Response 

Curve," in F. N. David, ed., Research Papers in Statistics, pp. 55-71. New York: 

John Wiley & Sons. 

1970. Analysis of Binary Data. London: Methuen. 

1972. "Regression Models and Life Tables." Journal of the Royal Statistical 

Society ser. B, 34:187 -220 (with discussion). 

1975. “Partial Likelihood.” Biometrika 62:269 -276. 

Cox, D. R., and D. V. Hinkley. 1974. Theoretical Statistics. London: Chapman and 
Hall. 

Cragg, J. G. 1971. "Some Statistical Models for Limited Dependent Variables with 
Application to the Demand for Durable Goods." Econometrica 39:829 —844. 

Cramér, H. 1946. Mathematical Methods of Statistics. Princeton: Princeton Univer- 
sity Press. 

Crowder, M. J. 1980. “On the Asymptotic Properties of Least-Squares Estimators in 
Autoregression." Annals of Statistics 8:132— 146. 

Dagenais, M. G. 1978. “The Computation of FIML Estimates as Iterative Generalized 
Least Squares Estimates in Linear and Nonlinear Simultaneous Equations 
Models.” Econometrica 46:1351—1362. 

David, J. M., and W. E. Legg. 1975. “An Application of Multivariate Probit Analysis 
to the Demand for Housing: A Contribution to the Improvement of the Predictive 
Performance of Demand Theory, Preliminary Results.” American Statistical As- 
sociation Proceedings of the Business and Economics and Statistics Section, pp. 
295-300. 

Davidson, W. C. 1959. “Variable Metric Method for Minimization.” Atomic Energy 
Commission, Research Development Report ANL-5990, Washington, D.C. 
Davis, L. 1984. "Comments on a Paper by T. Amemiya on Estimation in a Dichoto- 

mous Logit Regression Model.” Annals of Statistics 12:778— 782. 

Deacon, R., and P. Shapiro. 1975. “Private Preference for Collective Goods Revealed 
through Voting and Referenda.” American Economic Review 65:943—955. 
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from 
Incomplete Data via the EM Algorithm." Journal of the Royal Statistical Society 

ser. B, 39:1—38 (with discussion). 

Dempster, A. P., M. Schatzoff, and N. Wermuth. 1977. “A Simulation Study of 
Alternatives to Ordinary Least Squares." Journal of the American Statistical 
Association 72:77 - 111 (with discussion). | 

Dent, W. T., and C. Hildreth. 1977. *Maximum Likelihood Estimation in Random 
Coefficient Models." Journal of the American Statistical Association 72:69 -72. 


484 References 


Dhrymes, P. J. 1971. Distributed Lags: Problems of Estimation and Formulation. San 
Francisco: Holden-Day. 

Diewert, W. E. 1974. “Applications of Duality Theory,” in M. D. Intrilligator and 
D. A. Kendrick, eds., Frontiers of Quantitative Economics, 2:106— 171. Amster- 
dam: North-Holland Publishing. 

Domencich, T. A., and D. McFadden. 1975. Urban Travel Demand. Amsterdam: 
North-Holland Publishing. 

Doob, J. L. 1953. Stochastic Processes. New York: John Wiley & Sons. 

Draper, N. R., and D. R. Cox. 1969. “On Distributions and Their Transformation to 
Normality.” Journal of the Royal Statistical Society ser. B, 31:472-476. 

Draper, N. R., and H. Smith. 1981. Applied Regression Analysis, 2d ed. New York: 
John Wiley & Sons. 

Draper, N. R., and R. C. Van Nostrand. 1979. "Ridge Regression and James-Stein 
Estimation: Review and Comments." Technometrics 21:451-466. 

Dubin, J. A., and D. McFadden. 1984. “An Econometric Analysis of Residential 
Electric Appliance Holdings and Consumption.” Econometrica 52:345 — 362. 

Dudley, L., and C. Montmarquette. 1976. “A Model ofthe Supply of Bilateral Foreign 
Aid.” American Economic Review 66:132—142. 

Duncan, G. M. 1980. "Formulation and Statistical Analysis of the Mixed, Continu- 
ous/Discrete Dependent Variable Model in Classical Production Theory." Econ- 
ometrica 48:839 — 852. 

Duncan, G. T., and L. G. Lin. 1972. “Inference for Markov Chains Having Stochastic 
Entry and Exit." Journal of the American Statistical Association 67:761 -767. 

Durbin, J. 1960. “Estimation of Parameters in Time-Series Regression Models." Jour- 
nal of the Royal Statistical Society ser. B, 22:139-153. 

—— 1963. “Maximum-Likelihood Estimation of the Parameters of a System of 
Simultaneous Regression Equations.” Paper presented at the Copenhagen Meet- 
ing of the Econometric Society. 

1970. “Testing for Serial Correlation in Least-Squares Regression When Some 
of the Regressors Are Lagged Dependent.” Econometrica 38:410—421. 

Durbin, J., and G. S. Watson. 1950. “Testing for Serial Correlation in Least Squares 
Regression, I.” Biometrika 37:409 -428. 

1951. “Testing for Serial Correlation in Least Squares Regression, IL." Bio- 

metrika 38:159 -178. 

1971. “Testing for Serial Correlation in Least Squares Regression, III.” Bio- 
metrika 58:119. 

Efron, B. 1975. “The Efficiency of Logistic Regression Compared to Normal Discrimi- 
nant Analysis." Journal of the American Statistical Association 70:892 — 898. 

1977. “The Efficiency of Cox's Likelihood Function for Censored Data.” 

Journal of the American Statistical Association 72:557 — 565. 

1982. The Jackknife, the Bootstrap, and Other Resampling Plans. Philadel- 

phia: Society for Industrial and Applied Mathematics. 


References 485 


Efron, B., and C. Morris. 1972. “Limiting the Risk of Bayes and Empirical Bayes 
Estimators — Part II: The Empirical Bayes Case.” Journal of the American Statis- 
tical Association 67:130— 139. 

1973. "Stein's Estimation Rule and Its Competitors — An Empirical Bayes 

Approach." Journal of the American Statistical Association 68:117 - 130. 

1975. "Data Analysis Using Stein's Estimator and Its Generalizations." Jour- 

nal of the American Statistical Association 70:311 —319. 

1976. “Families of Minimax Estimators of the Mean ofa Multivariate Normal 
Distribution." Annals of Statistics 4:11-21. 

Ehrlich, I. 1977. "Capital Punishment and Deterrence: Some Further Thoughts and 
Additional Evidence.” Journal of Political Economy 85:741 788. 

Eicker, F. 1963. “Asymptotic Normality and Consistency of the Least Squares Esti- 
mators for Families of Linear Regressions." Annals of Mathematical Statistics 
34:447 - 456. 

Eisenpress, H., and J. Greenstadt. 1966. “The Estimation of Nonlinear Econometric 
Systems." Econometrica 34:851—861. 

Engle, R. F. 1984. “Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econo- 
metrics," in Z. Griliches and M. D. Intrilligator, eds., Handbook of Econometrics, 
2:775- 826. Amsterdam: North-Holland Publishing. 

Fair, R. C. 1974. “On the Robust Estimation of Econometric Models." Annals of 
Economic and Social Measurement 3:667 -677. 

1976. A Model of Macroeconomic Activity. Vol. 2, The Empirical Model. 

Cambridge, Mass.: Ballinger. 

1978. “A Theory of Extramarital Affairs." Journal of Political Economy 
86:45-61. 

Fair, R. C., and D. M. Jaffee. 1972. "Methods of Estimation for Markets in Disequilib- 
rium." Econometrica 40:497 —514. 

Fair, R. C., and W. R. Parke. 1980. “Full-Information Estimates of a Nonlinear 
Macroeconometric Model." Journal of Econometrics 13:269 29]. 

Farebrother, R. W. 1975. “Minimax Regret Significance Points fora Preliminary Test 
in Regression Analysis." Econometrica 43:1005- 1006. 

1980. “The Durbin-Watson Test for Serial Correlation When There Is No 
Intercept in the Regression." Econometrica 48:1553—1563. 

Feller, W. 1961. An Introduction to Probability Theory and Its Applications, vol. 1, 2d 
ed. New York: John Wiley & Sons. 

Ferguson, T. S. 1958. *A Method of Generating Best Asymptotically Normal Esti- 
mates with Application to the Estimation of Bacterial Densities." Annals of Math- 
ematical Statistics 29:1046— 1062. 

Fisher, F. M. 1966. The Identification Problem in Econometrics. New York: McGraw- 
Hill. 

1970. “A Correspondence Principle for Simultaneous Equation Models." 

Econometrica 38:73-92. 


486 References 


Fletcher, R., and M. J. D. Powell. 1963. “A Rapidly Convergent Descent Method for 
Minimization.” Computer Journal 6:163-168. 

Flinn, C. J., and J. J. Heckman. 1982. “Models for the Analysis of Labor Force 
Dynamics.” Advances in Econometrics 1:35-95. 

Fomby, T. B., R. C. Hill, and S. R. Johnson. 1978. “An Optimal Property of Principal 
Components in the Context of Restricted Least Squares." Journal ofthe American 
Statistical Association 73:191— 193. 

Forsythe, A. B. 1972. "Robust Estimation of Straight Line Regression Coefficients by 
Minimizing p-th Power Deviations." Technometrics 14:159 — 166. 

Freund, J. F. 1971. Mathematical Statistics, 2d ed. Englewood Cliffs, N.J.: Prentice- 
Hall. 

Froehlich, B. R. 1973. "Some Estimates fora Random Coefficient Regression Model." 
Journal of the American Statistical Association. 68:329 - 335. 

Fuller, W. A. 1976. Introduction to Statistical Time Series. New York: John Wiley & 
Sons. 

1977. “Some Properties of a Modification of the Limited Information Estima- 
tor." Econometrica 45:939 -953. 

Fuller, W. A., and G. E. Battese. 1974. "Estimation of Linear Models with Crossed- 
Error Structure." Journal of Econometrics 2:67-78. 

Gallant, A. R. 1975a. "Nonlinear Regression." The American Statistician 29:73-81. 

1975b. “Testing a Subset of the Parameters of a Nonlinear Regression Model." 

Journal of the American Statistical Association 70:927 932. 

1977. “Three-Stage Least-Squares Estimation for a System of Simultaneous, 
Nonlinear, Implicit Equations." Journal of Econometrics 5:71 -88. 

Gallant, A. R., and A. Holly. 1980. "Statistical Inference in an Implicit, Nonlinear, 
Simultaneous Equation Model in the Context of Maximum Likelihood Estima- 
tion." Econometrica 48:697 —720. 

Gallant, A. R., and D. W. Jorgenson. 1979. "Statistical Inference for a System of 
Simultaneous, Nonlinear, Implicit Equations in the Context of Instrumental 
Variable Estimation." Journal of Econometrics 11:275 — 302. 

Gastwirth, J. L. 1966. “On Robust Procedures." Journal of the American Statistical 
Association 65:946 9773. 

Gaver, K. M., and M. S. Geisel. 1974. “Discriminating Among Alternative Models: 
Bayesian and Non-Bayesian Methods," in P. Zarembka, ed., Frontiers in Econo- 
metrics. pp. 49-80. New York: Academic Press. 

Ghosh, J. K., and B. K. Sinha. 1981. “A Necessary and Sufficient Condition for 
Second Order Admissibility with Applications to Berkson's Bioassay Problem." 
Annals of Statistics 9:1334— 1338. . 

Ghosh, J. K., and K. Subramanyam. 1974. “Second-Order Efficiency of Maximum 
Likelihood Estimators.” Sankhya ser. A, 36:325-358. 

Gnedenko, B. V., and A. N. Kolmogorov. 1954. Limit Distributions for Sums of 
Independent Random Variables. Reading, Mass.: Addison-Wesley. 


References 487 


Godfrey, L. G. 1978. “Testing Against General Autoregressive and Moving Average 
Error Models When the Regressors Include Lagged Dependent Variables.” Econ- 
ometrica 46:1293- 1302. 

Goldberg, S. 1958. Introduction to Difference Equations. New York: John Wiley & 
Sons. 

Goldberger, A. S. 1964. Econometric Theory. New York: John Wiley & Sons. 

1981. "Linear Regression After Selection." Journal of Econometrics 15:357 - 

366. 

1983. "Abnormal Selection Bias," in S. Karlin, T. Amemiya, and L. A. Good- 
man, eds., Studies in Econometrics, Time Series, and Multivariate Statistics, pp. 
67 - 84. New York: Academic Press. 

Goldfeld, S. M., and R. E. Quandt. 1965. "Some Tests for Homoscedasticity.” Journal 
of the American Statistical Association 60:539 —547. 

1968. “Nonlinear Simultaneous Equations: Estimation and Prediction." In- 

ternational Economic Review 9:113-136. 

1972. Nonlinear Methods in Econometrics. Amsterdam: North-Holland Pub- 

lishing. 

1978. "Asymptotic Tests for the Constancy of Regressions in the Heterosce- 
dastic Case." Research Memorandum no. 229, Econometric Research Program, 
Princeton University. 

Goldfeld, S. M., R. E. Quandt, and H. F. Trotter. 1966. “Maximization by Quadratic 
Hill-Climbing.” Econometrica 34:541 — 551. 

Goodman, L. A. 1961. “Statistical Methods for the *Mover-Stayer' Model." Journal of 
the American Statistical Association 56:841 - 868. 

1972. “A Modified Multiple Regression Approach to the Analysis of Dichoto- 
mous Variables." American Sociological Review 37:28-46. 

Gourieroux, C., J. J. Laffont, and A. Monfort. 1980. '*Coherency Conditions in Simul- 
taneous Linear Equation Models with Endogenous Switching Regimes." Econo- 
metrica 48:675 —695. 

Gourieroux, C., and A. Monfort. 1981. “Asymptotic Properties of the Maximum 
Likelihood Estimator in Dichotomous Logit Models." Journal of Econometrics 
17:83-97. 

Gradshteyn, I. S., and I. M. Ryzhik. 1965. Table of Integrals, Series, and Products. 
New York: Academic Press. 

Granger, C. W. J., and P. Newbold. 1977. Forecasting Economic Time Series. New 
York: Academic Press. 

Greene, W. H. 1981. "On the Asymptotic Bias of the Ordinary Least Squares Estima- 
tor of the Tobit Model." Econometrica 49:505 -513. 

1983. “Estimation of Limited Dependent Variable Models by Ordinary Least 
Squares and the Method of Moments." Journal of Econometrics 21:195—- 212. 

Grenander, U., and G. Szego. 1958. Toeplitz Forms and Their Applications. Berkeley: 
University of California Press. 


488 References 


Griliches, Z. 1967. "Distributed Lags: A Survey.” Econometrica 35:16-49. 

Gronau, R. 1973. “The Effects of Children on the Housewife’s Value of Time.” 
Journal of Political Economy 81:S168~S199. 

1974. “Wage Comparisons —a Selectivity Bias." Journal of Political Economy 
82:1119- 1143. 

Guilkey, D. K., and P. Schmidt. 1979. *Some Small Sample Properties of Estimators 
and Test Statistics in the Multivariate Logit Model." Journal of Econometrics 
10:33- 42. 

Gumbel, E. J. 1961. “Bivariate Logistic Distributions." Journal of the American Sta- 
tistical Association 56:335 - 349. , 

Gunst, R. F., and R. L. Mason. 1977. “Biased Estimation in Regression: An Evalua- 
tion Using Mean Squared Error.” Journal of the American Statistical Association 
72:616-628. 

Gurland, J., I. Lee, and P. A. Dahm. 1960. “Polychotomous Quantal Response in 
Biological Asssay.” Biometrics 16:382—398. 

Haberman, S. J. 1978. Analysis of Qualitative Data. Vol. 1, Introductory Topics. New 
York: Academic Press. 

1979. Analysis of Qualitative Data. Vol. 2, New Developments. New York: 
Academic Press. 

Haessel, W. 1976. “Demand for Agricultural Commodities in Ghana: An Application 
of Nonlinear Two-Stage Least Squares with Prior Information." American Jour- 
nal of Agricultural Economics 58:341 - 345. 

Halmos, D. R. 1950. Measure Theory. Princeton: D. Van Nostrand. 

Hammerstrom, T. 1981. “Asymptotically Optimal Tests for Heteroscedasticity in the 
General Linear Model." Annals of Statistics 9:368 — 380. 

Han, A. K. 1983. "Asymptotic Efficiency of the Partial Likelihood Estimator in the 
Proportional Hazard Model." Technical Report no. 412, Institute for Mathemati- 
cal Studies in the Social Sciences, Stanford University, Calif. 

Hartley, H. O. 1958. “Maximum Likelihood Estimation from Incomplete Data.” 
Biometrics 14:174— 194. 

1961. “The Modified Gauss-Newton Method for the Fitting of Non-Linear 
Regression Functions by Least Squares.” Technometrics 3:269 — 280. 

Hartley, M. J. 1976a. "The Estimation of Markets in Disequilibrium: The Fixed 
Supply Case." International Economic Review 17:687 —699. 

1976b. “Estimation of the Tobit Model by Nonlinear Least Squares Methods.” 

Discussion Paper no. 373, State University of New York, Buffalo. 

1976c. “The Tobit and Probit Models: Maximum Likelihood Estimation by 
Ordinary Least Squares." Discussion Paper no. 374, State University of New 
York, Buffalo. 

Harvey, A. C. 1976. “Estimating Regression Models with Multiplicative Heterosce- 
dasticity." Econometrica 44:461 —465. 

1978. “The Estimation of Time-Varying Parameters from Panel Data." An- 

nales de l'insee 30-31:203-226. 


References 489 


1981a. The Econometric Analysis of Time Series. Oxford: Philip Allan Pub- 

lishers. 

1981b. Time Series Models. Oxford: Philip Allan Publishers. 

Hatanaka, M. 1974. “An Efficient Two-Step Estimator for the Dynamic Adjustment 
Model with Autoregressive Errors.” Journal of Econometrics 2:199 — 220. 

1978. "On the Efficient Estimation Methods for the Macro-Economic Models 
Nonlinear in Variables." Journal of Econometrics 8:323 —356. 

Hause, J. C. 1980. “The Fine Structure of Earnings and On-the-Job Training Hypoth- 
esis.” Econometrica 48:1013- 1029. 

Hausman, J. A. (975. “An Instrumental Variable Approach to Full Information 
Estimators for Linear and Certain Nonlinear Econometric Models.” Economet- 
rica 43:727 -738. 

1978. “Specification Tests in Econometrics.” Econometrica 46:1251 - 1272. 

Hausman, J. A., and D. McFadden. 1984. "Specification Tests for the Multinomial 
Logit Model.” Econometrica 52:1219— 1240. 

Hausman, J. A., and W. E. Taylor. 1981. “Panel Data and Unobservable Individual 
Effects." Econometrica 49:1377 —1398. 

Hausman, J. A., and D. A. Wise. 1976. “The Evaluation of Results from Truncated 
Samples: The New Jersey Income Maintenance Experiment." Annals of Eco- 
nomic and Social Measurement 5:421 —445. 

1977. “Social Experimentation, Truncated Distributions, and Efficient Esti- 

mation." Econometrica 45:919 —938. 

1978. “A Conditional Probit Model for Qualitative Choice: Discrete Decisions 

Recognizing Interdependence and Heterogeneous Preferences." Econometrica 

46:403- 426. 

1979. “Attrition Bias in Experimental and Panel Data: The Gary Income 
Maintenance Experiment." Econometrica 47:455 —473. 

Heckman, J. J. 1974. “Shadow Prices, Market Wages, and Labor Supply." Economet- 
rica 42:679 -693. 

1976a. “The Common Structure of Statistical Models of Truncation, Sample 

Selection and Limited Dependent Variables and a Simple Estimator for Such 

Models." Annals of Economic and Social Measurement 5:475 -492. 

1976b. “Simultaneous Equations Models with Continuous and Discrete En- 

dogenous Variables and Structura) Shifts," in S. M. Goldfeld and R. E. Quandt, 

eds., Studies in Nonlinear Estimation, pp. 235—272. Cambridge, Mass.: Ballinger 

Publishing. 

1978. “Dummy Endogenous Variables in a Simultaneous Equation System.” 

Econometrica 46:931 —960. 

1979. “Sample Selection Bias as a Specification Error." Econometrica 

47:153-161. 

1981a. “Statistical Models for Discrete Panel Data," in C. F. Manski and 

D. McFadden, eds., Structural Analysis of Discrete Data with Econometric Appli- 

cations, pp. 114—178. Cambridge, Mass.: MIT Press. 


490 References 


198 1b. "The Incidental Parameters Problem and the Problem of Initia] Condi- 

tions in Estimating a Discrete Time— Discrete Data Stochastic Process,” in 

C. F. Manski and D. McFadden, eds., Structural Analysis of Discrete Data with 

Econometric Applications, pp. 179—195. Cambridge, Mass.: MIT Press. 

1981c. “Heterogeneity and State Dependence," in S. Rosen, ed., Studies in 
Labor Markets, pp. 91—139. Cambridge, Mass.: National Bureau of Economic 
Reseach. 

Heckman, J. J., and G. J. Borjas. 1980. “Does Unemployment Cause Future Unem- 
ployment? Definitions, Questions and Answers from a Continuous Time Model 
of Heterogeneity and State Dependence.” Economica 47:247 -283. 

Heckman, J. J., and T. E. MaCurdy. 1980. “A Life Cycle Model of Female Labor 
Supply." Review of Economic Studies 47:47 —74. 

Heckman, J. J., and S. Polachek. 1974. “Empirical Evidence on the Functional Form 
of the Earnings-Schooling Relationship." Journal of the American Statistical As- 
sociation 69:350—354. 

Heckman, J. J., and B. Singer. 1982. “The Identification Problem in Econometric 
Models for Duration Data," in W. Hildenbrand, ed., Advances in Econometrics, 
pp. 39-77. Cambridge: Cambridge University Press. 

1984a. “A Method for Minimizing the Impact of Distributional Assumptions 

in Econometric Models for Duration Data.” Econometrica 52:271-— 320. 

1984b. “Econometric Duration Analysis.” Journal of Econometrics 24:63- 
132. 

Heckman, J. J., and R. J. Willis. 1977. “A Beta-Logistic Model for the Analysis of 
Sequential Labor Force Participation by Married Women." Journal of Political 
Economy 85:27-58. 

Hettmansperger, T. P., and J. W. McKean. 1977. “A Robust Alternative Based on 
Ranks to Least Squares in Analyzing Linear Models." Technometrics 19:275- 
284. 

Hildreth, C., and J. P. Houck. 1968. "Some Estimators for a Linear Model with 
Random Coefficients." Journal of the American Statistical Association 63:584 — 
595. 

Hill, R. W., and P. W. Holland. 1977. “Two Robust Alternatives to Least-Squares 
Regression." Journal of the American Statistical Association 72:1041 - 1067. 

Hinkley, D. V. 1971. "Inference in Two-Phase Regression." Journal of the American 
Statistical Association 66:736 —743. 

1975. “On Power Transformations to Symmetry.” Biometrika 62:101-112. 

Hoadley, B. 1971. “Asymptotic Properties of Maximum Likelihood Estimators for the 
Independent Not Identically Distributed Case.” Annals of Mathematical Statis- 
tics 42:1977-1991. 

Hodges, J. L., and E. L. Lehmann. 1950. “Some Problems in Minimax Point Estima- 
tion." Annals of Mathematical Statistics 21:182-197. 

1963. “Estimates of Location Based on Rank Tests.” Annals of Mathematical 

Statistics 34:598-611. 


References 491 


Hoel, P. G. 1971. Introduction to Mathematical Statistics, 4th ed. New York: John 
Wiley & Sons. 

Hoerl, A. E., and R. W. Kennard. 1970a. “Ridge Regression: Biased Estimation for 
Nonorthogonal Problems.” Technometrics 12:55 -67. 

1970b. “Ridge Regression: Applications to Nonorthogonal Problems." Tech- 
nometrics 12:69-82. 

Hoerl, A. E., R. W. Kennard, and K. F. Baldwin. 1975. "Ridge Regression: Some 
Simulations." Communications in Statistics 4:105- 123. 

Hogg, R. V. 1974. "Adaptive Robust Procedures: A Partial Review and Some Sugges- 
tions for Future Applications and Theory." Journal of the American Statistical 
Association 69:909 - 923. 

Holford, T. R. 1980, “The Analysis of Rates and of Survivorship Using Log-Linear 
Models." Biometrics 36:299 - 305. 

Horowitz, J. L., J. M. Sparmann, and C. F. Daganzo. 1982. “An Investigation of the 
Accuracy ofthe Clark Approximation for the Multinomial Probit Model." Trans- 
portation Science 16:382 —401. 

Hosek, J. R. 1980. "Determinants of Family Participation in the AFDC-Unemployed 
Fathers Program." Review of Economics and Statistics 62:466 470. 

Howe, H., R. A. Pollack, and T. J. Wales. 1979. “Theory and Time Series Esti- 
mation of the Quadratic Expenditure System." Econometrica 47:1231- 
1248. 

Hsiao, C. 1974. “Statistical Inference for a Model with Both Random Cross-Sectional 
and Time Effects.” International Economic Review 15:12-30. 

1975. “Some Estimation Methods for a Random Coefficient Model." Econo- 

metrica 43:305 - 325. 

1983. "Identification," in Z. Griliches and M. D. Intrilligator, eds., Handbook 
of Econometrics 1:223- 283. Amsterdam: North-Holland Publishing. 

Hsieh, D., C. F. Manski, and D. McFadden. 1983. “Estimation of Response Probabili- 
ties from Augmented Retrospective Observations." Mimeographed Paper, De- 
partment of Economics, Massachusetts Institute of Technology, Cambridge, 
Mass. 

Huang, D. S. 1964. “Discrete Stock Adjustment: The Case of Demand for Automo- 
biles." International Economic Review 5:46 - 62. 

Huber, P. J. 1964. “Robust Estimation of a Location Parameter." Annals of Mathe- 
matical Statistics 35:73—- 101. 

1965. “The Behavior of Maximum Likelihood Estimates under Nonstandard 

Conditions,” in J. Neyman, ed., Proceedings of the Fifth Berkeley Symposium, 

1:221 -233. Berkeley: University of California Press. 

1972. "Robust Statistics: A Review." Annals of Mathematical Statistics 

43:1041— 1067. 

1977. Robust Statistical Procedures. Philadelphia: Society for Industrial and 

Applied Mathematics. 

1981. Robust Statistics. New York: John Wiley & Sons. 


492 References 


Hurd, M. 1979. “Estimation in Truncated Samples When There Is Heteroscedas- 
ticity.” Journal of Econometrics 11:247-258. 

Imhof, J. P. 1961. "Computing the Distribution of Quadratic Forms in Normal Vari- 
ables.” Biometrika 48:419-426. 

Jaeckel, L. A. 1972. “Estimating Regression Coefficients by Minimizing the Disper- 
sion of the Residuals.” Annals of Mathematical Statistics 43:1449 — 1458. 

James, W., and C. Stein. 1961. “Estimation with Quadratic Loss," in J. Neyman, ed., 
Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and 
Probability, 1:361 —379. Berkeley: University of California Press. 

Jennrich, R. I. 1969. “Asymptotic Properties of Non-Linear Least Squares Estima- 
tors." The Annals of Mathematical Statistics 40:633 —643. 

Jobson, J. D., and W. A. Fuller. 1980. “Least Squares Estimation When the Covar- 
iance Matrix and Parameter Vector Are Functionally Related." Journal of the 
American Statistical Association 75:176- 181. 

Johnson, N. L., and S. Kotz. 1970a. Continuous Univariate Distributions — 1. Boston: 
Houghton Mifflin. 

1970b. Continuous Univariate Distributions — 2. Boston: Houghton Mifflin. 

1972. Distributions in Statistics: Continuous Multivariate Distributions. New 
York: John Wiley & Sons. 

Johnston, J. 1972. Econometric Methods. 2d ed. New York: McGraw-Hill. 

Joreskog, K. G., and D. Sorbom. 1976. LISREL III Estimation of Structural Equation 
Systems by Maximum Likelihood Methods. Chicago: National Educational Re- 
Sources. 

Jorgenson, D. W., and J. Laffont. 1974. “Efficient Estimation of Nonlinear Simulta- 
neous Equations with Additive Disturbances." Annals of Economic and Social 
Measurement 3:615 -640. 

Jorgenson, D. W., and L. J. Lau. 1975. “The Structure of Consumer Preferences.” 
Annals of Economic and Social Measurement 4:49 - 101. 

1978. "Testing the Integrability of Consumer Demand Functions, United 
States, 1947 1971." Mimeographed paper. 

Judge, G. G., and M. E. Bock, 1983. “Biased Estimation," in Z. Griliches and M. D. 
Intrilligator, eds., Handbook of Econometrics, 1:599 649. Amsterdam: North- 
Holland Publishing. 

Kahn, L. M., and K. Morimune. 1979. "Unions and Employment Stability: A Se- 
quential Logit Approach." International Economic Review 20:217 — 236. 

Kakwani, N. C. 1967. “The Unbiasedness of Zellner's Seemingly Unrelated Regres- 
sion Equations Estimators." Journal of the American Statistical Association 
62:141-142. 

Kalbfleisch, J. D. 1974. “Some Efficiency Calculations for Survival Distributions." 
Biometrika 61:31 - 38. 

Kalbfleisch, J. D., and R. L. Prentice. 1973. "Marginal Likelihoods Based On Cox's 
Regression and Life Models." Biometrika 60:267 -278. 


References 493. 


1980. The Statistical Analysis of Failure Time Data. New York: John Wiley & 
Sons. 

Kariya, T. 1981. “Bounds for the Covariance Matrices of Zellner’s Estimator in the 
SUR Model and the 2SAE in a Heteroscedastic Model.” Journal of the American 
Statistical Asssociation 76:975-979. 

Kay, R. 1979. "Some Further Asymptotic Efficiency Calculations for Survival Data 
Regression Models.” Biometrika 66:91 -96. 

Keeley, M. C., P. K. Robins, R. G. Spiegelman, and R. W. West. 1978. “The Estima- 
tion of Labor Supply Models Using Experimental Data.” American Economic 
Review 68:873-887. 

Kelejian, H. H. 1971. “Two-Stage Least Squares and Econometric Systems Linear in 
Parameters but Nonlinear in the Endogenous Variables." Journal of the American 
Statistical Association 66:373—374, 

1974. “Efficient Instrumental Variable Estimation of Large Scale Nonlinear 
Econometric Models." Mimeographed paper. 

Kelejian, H. H., and S. W. Stephan. 1983. “Inference in Random Coefficient Panel 
Data Models: A Correction and Clarification of the Literature." International 
Economic Review 24:249 — 254. 

Kendall, M. G., and A. Stuart. 1979. The Advanced Theory of Statistics, 4th ed., vol. 2. 
New York: Charles Griffin and Co. 

Kenny, L. W., L. F. Lee, G. S. Maddala, and R. P. Trost. 1979. “Returns to College 
Education: An Investigation of Self-Selection Bias Based on the Project Talent 
Data.” International Economic Review 20:775 — 789. 

Kiefer, J., and J. Wolfowitz. 1956. “Consistency of the Maximum Likelihood Estima- 
tor in the Presence of Infinitely Many Incidental Parameters." Annals of Mathe- 
matical Statistics 27:887 -906. 

Koenker, R. 1981a. "Robust Methods in Econometrics." Bell Laboratories Eco- 
nomics Discussion Paper no. 228. 

1981b. *A Note on Studentizing a Test for Heteroscedasticity." Journal of 
Econometrics 17:107 -112. 

Koenker, R., and G. Bassett, Jr. 1978. "Regression Quantiles.” Econometrica 
46:33- 50. 

1982. "Robust Tests for Heteroscedasticity Based on Regression Quantiles." 
Econometrica 50:43-61. 

Koopmans, T. C., and W. C. Hood. 1953. “The Estimation of Simultaneous 
Linear Economic Relationships," in W. C. Hood and T. C. Koopmans, eds., 
Studies in Econometric Method, pp. 112-199. New York: John Wiley 
& Sons. 

Kotlikoff, L. J. 1979. “Testing the Theory of Social Security and Life Cycle Accumula- 
tion." American Economic Review 69:396 - 410. 

Koyck, L. M. 1954. Distributed Lags and Investment Analysis. Amsterdam: North- 
Holland Publishing. 


494 References 


Lachenbruch, P. A., C. Sneeringer, and L. T. Revo. 1973. “Robustness of the Linear 
and Quadratic Discriminant Function to Certain Types of Nonnormality.” Com- 
munications in Statistics 1:39-56. 

Lai, T. L., H. Robbins, and C. Z. Wei. 1978. “Strong Consistency of Least Squares 
Estimates in Multiple Regression.” Proceedings of the National Academy of 
Sciences 75:3034 - 3036. 

Lancaster, T. 1979. “Econometric Methods for the Duration of Unemployment.” 
Econometrica 47:939 -956. 

LeCam, L. 1953. “On Some Asymptotic Properties of Maximum Likelihood Esti- 
mates and Related Bayes Estimates.” University of California Publications in 
Statistics 1:277 - 330. 

Lee, L. F. 1977. "Estimation of a Modal Choice Model for the Work Journey with 
Incomplete Observations." Mimeographed paper, Department of Economics, 
University of Minnesota. 

1978. “Unionism and Wage Rates: A Simultaneous Equations Model with 

Qualitative and Limited Dependent Variables." International Economic Review 

19:415-433. 

1981. "Simultaneous Equations Models with Discrete and Censored Vari- 

ables," in C. F. Manski and D. McFadden, eds., Structural Analysis of Discrete 

Data with Econometric Applications, pp. 346—364. Cambridge, Mass.: MIT 

Press. 

1982a. “Health and Wage: A Simultaneous Equation Model with Multiple 

Discrete Indicators.” International Economic Review 23:199 -221. 

1982b. “A Bivariate Logit Model." Mimeographed paper, Center for Econo- 

metrics and Decision Sciences, University of Florida. 

1982c. "Some Approaches to the Correction of Selectivity Bias." Review of 
Economic Studies 49:355 - 372. 

Lee, L. F., G. S. Maddala, and R. P. Trost. 1980. “Asymptotic Covariance Matrices of 
Two-Stage Probit and Two-Stage Tobit Methods for Simultaneous Equations 
Models with Selectivity.” Econometrica 48:491 ~ 503. 

Lerman, S. R., and C. F. Manski. 1981. “On the Use of Simulated Frequencies to 
Approximate Choice Probabilities,” in C. F. Manski and D. McFadden, eds., 
Structural Analysis of Discrete Data with Econometric Applications, pp. 305- 
319. Cambridge, Mass.: MIT Press. 

Li, M. M. 1977. “A Logit Model of Homeownership.” Econometrica 45:1081-— 1098. 

Lillard, L. A., and Y. Weiss. 1979. “Components of Variation in Panel Earnings Data: 
American Scientists 1960-70.” Econometrica 47:437 454. 

Lillard, L. A., and R. Willis. 1978. “Dynamic Aspects of Earnings Mobility." Econo- 
metrica 46:985 - 1012. 

Lindley, D. V. 1972. Bayesian Statistics, A Review. Philadephia: Society for Industrial 
and Applied Mathematics. 

Loéve, M. 1977. Probability Theory, 4th ed. Princeton: D. Van Nostrand. 


References 495 


McCall, J. J. 1971. “A Markovian Model of Income Dynamics.” Journal of the Ameri- 
can Statistical Association 66:439 ~ 447, 

McFadden, D. 1974. “Conditional Logit Analysis of Qualitative Choice Behavior,” in 
P. Zarembka, ed., Frontiers in Econometrics, pp. 105—142. New York: Academic 
Press. 

1976a. “A Comment on Discriminant Analysis ‘versus’ Logit Analysis.” 

Annals of Economic and Social Measurement 5:511-524. 

1976b. “The Revealed Preferences of a Government Bureaucracy: Empirical 

Evidence." Bell Journal of Economics 1:55-72. 

1977. “Quantitative Methods for Analyzing Travel Behavior of Individuals: 

Some Recent Developments.” Cowles Foundation Discussion Paper no. 474. 

1978. “Modelling the Choice of Residential Location,” in A. Karlqvist et al., 

eds., Spatial Interaction Theory and Planning Models, pp. 75-96. Amsterdam: 

North-Holland Publishing. 

1981. “Econometric Models of Probabilistic Choice,” in C. F. Manski and D. 
McFadden, eds., Structural Analysis of Discrete Data with Econometric Applica- 
tions, pp. 198-272. Cambridge, Mass.: MIT Press. 

McFadden, D., and F. Reid. 1975. “Aggregate Travel Demand Forecasting from 
Disaggregated Behavior Models.” Transportation Research Board, Record, no. 
534, Washington, D. C. 

McGillivray, R. G. 1972. “Binary Choice of Urban Transport Mode in the San Fran- 
cisco Bay Region.” Econometrica 40:827 -848. 

McKean, J. W., and T. P. Hettmansperger. 1976. “Tests of Hypothesis in the General 
Linear Model Based on Ranks." Communications in Statistics A,1:693-709. 

MacRae, E. 1977. "Estimation of Time-Varying Markov Process with Aggregate 
Data." Econometrica 45:183- 198. 

MaCurdy, T. E. 1980. “An Intertemporal Analysis of Taxation and Work Disincen- 
tives.” Working Papers in Economics no. E-80-4, The Hoover Institution, Stan- 
ford University, Calif. 

1982. “The Use of Time Series Processes to Model the Error Structure 
of Earnings in a Longitudinal Data Analysis.” Journal of Econometrics. 18:83— 
114. 

Maddala, G. S. 1971. “The Use of Variance Components Models in Pooling Cross 
Section and Time Series Data.” Econometrica 39:341 —358. 

1980. “Disequilibrium, Self-Selection and Switching Models." Social Science 

Working Paper 303, California Institute of Technology. 

1983. Limited-Dependent and Qualitative Variables in Econometrics. Cam- 
bridge: Cambridge University Press. 

Maddala, G. S., and F. D. Nelson. 1974. *Maximum Likelihood Methods for Models 
of Markets in Disequilibrium.” Econometrica 42:1013— 1030. 

Malik, H. J., and B. Abraham. 1973. “Multivariate Logistic Distributions." Annals of 
Statistics 1:588 — 590. 


496 References 


Malinvaud, E. 1961. “Estimation et prévision dans les modéles Economiques autoré- 
gressifs." Revue de l'institut international de statistique 29:1 -32. 

1980. Statistical Methods of Econometrics, 3d rev. ed. Amsterdam: North- 
Holland Publishing. 

Mallows, C. L. 1964. “Choosing Variables in a Linear Regression: A Graphical Aid.” 
Paper presented at the Central Region Meeting of the Institute of Mathematical 
Statistics, Manhattan, Kansas. 

Mandelbrot, B. 1963. “New Methods in Statistical Economics." Journal of Political 
Economy 71:421 - 440. 

Mann, H. B., and A. Wald. 1943. “On Stochastic Limit and Order Relationships.” 
Annals of Mathematical Statistics 14:217 —226. 

Manski, C. F. 1975. “The Maximum Score Estimation of the Stochastic Utility Model 
of Choice." Journal of Econometrics 3:205 228. 

1985. “Semiparametric Analysis of Discrete Response: Asymptotic Properties 
of the Maximum Score Estimator." Journal of Econometrics 27:313- 333. 

Manski, C. F., and S. R. Lerman. 1977. “The Estimation of Choice Probabilities from 
Choice-Based Samples.” Econometrica 45:1977 — 1988. 

Manski, C. F., and D. McFadden. 1981. “Alternative Estimators and Sample Designs 
for Discrete Choice Analysis," in C. F. Manski and D. McFadden, eds., Structural 
Analysis of Discrete Data with Econometric Applications, pp. 2-50. Cambridge, 
Mass.: MIT Press. 

Marcus, M., and H. Minc. 1964. 4 Survey of Matrix Theory and Matrix Inequalities. 
Boston: Prindle, Weber & Schmidt. 

Mariano, R. S. 1982. "Analytical Small-Sample Distribution Theory in Econometrics: 
The Simultaneous-Equations Case." International Economic Review 23:503- 
533. 

Mariano, R. S., and B. W. Brown. 1983. “Asymptotic Behavior of Predictors in a 
Nonlinear Simultaneous System.” International Economic Review 24:523 — 536. 

Marquardt, D. W. 1963. “An Algorithm for the Estimation of Non-Linear Parame- 
ters.” Society for Industrial and Applied Mathematics Journal 11:431—441. 

Mayer, T. 1975. "Selecting Economic Hypotheses by Goodness of Fit." Economic 
Journal 85:877 - 883. 

Miller, R. G., Jr. 1981. Survival Analysis. New York: John Wiley & Sons. 

Mizon, G. E. 1977. “Inferential Procedures in Nonlinear Models: An Application in a 
UK Industrial Cross Section Study of Factor Substitution and Returns to Scale." 
Econometrica 45:1221 - 1242. 

Mood, A. M., F. A. Graybill, and D. C. Boes. 1974. Introduction to the Theory of 
Statistics, 3d ed. New York: McGraw-Hill. 

Morimune, K. 1979. “Comparisons of Normal and Logistic Models in the Bivariate 
Dichotomous Analysis.” Econometrica 47:957 —976. 

Mosteller, F. 1946. “On Some Useful ‘Inefficient’ Statistics." Annals of Mathematical 
Statistics 17:377 -408. 


References 497 


Muthén, B. 1979. “A Structural Probit Model with Latent Variables." Journal of the 
American Statistical Association 74:807 -811. 

Nagar, A. L., and S. N. Sahay. 1978. “The Bias and Mean Squared Error of Forecasts 
from Partially Restricted Reduced Form." Journal of Econometrics 7:227 -243. 

Nakamura, M., A. Nakamura, and D. Cullen. 1979. “Job Opportunities, the Offered 
Wage, and the Labor Supply of Married Women." American Economic Review 
69:787 - 805. 

Nelder, J. A., and R. W. M. Wedderburn. 1972. "Generalized Linear Models." Jour- 
nal of the Royal Statistical Society ser. B, 135:370—384. 

Nelson, F. D. 1977. “Censored Regression Models with Unobserved, Stochastic Cen- 
soring Thresholds.” Journal of Econometrics 6:309 - 327. 

1981. “A Test for Misspecification in the Censored Normal Model.” Econo- 
metrica 49:1317-1329. 

Nelson, F. D., and L. Olson. 1978. “Specification and Estimation of a Simultaneous- 
Equation Model with Limited Dependent Variables.” International Economic 
Review 19:695 —709. 

Nerlove, M. 1958. Distributed Lags and Demand Analysis for Agricultural and Other 
Commodities. Washington, D.C.: U.S. Department of Agriculture. 

1971. “Further Evidence on the Estimation of Dynamic Relations from a 
Time Series of Cross Sections.” Econometrica 39:359 — 382. 

Nerlove, M., D. M. Grether, and J. L. Carvalho. 1979. Analysis of Economic Time 
Series: A Synthesis. New York: Academic Press. 

Nerlove, M., and S. J. Press. 1973. “Univariate and Multivariate Log-Linear and 
Logistic Models.” RAND Corporation Paper R-1306-EDA/NIH, Santa Monica, 
Calif. 

Neyman, J., and E. L. Scott. 1948. “Consistent Estimates Based on Partially Consis- 
tent Observations.” Econometrica 16:1-32. 

Nickell, S. 1979. “Estimating the Probability of Leaving Unemployment.” Economet- 
rica 47:1249 - 1266. 

Norden, R. H. 1972. “A Survey of Maximum Likelihood Estimation." International 
Statistical Revue 40:329 - 354. 

1973. “A Survey of Maximum Likelihood Estimation, Part 2." International 
Statistical Revue 41:39 — 58. 

Oberhofer, W., and J. Kmenta. 1974. “A General Procedure for Obtaining Maximum 
Likelihood Estimates in Generalized Regression Models." Econometrica 
42:519 - 590. 

Olsen, R. J. 1978. “Note on the Uniqueness of the Maximum Likelihood Estimator for 
the Tobit Model." Econometrica 46:1211-1215. 

1980. “A Least Squares Correction for Selectivity Bias.” Econometrica 
48:1815- 1820. 

Paarsch, H. J. 1984. “A Monte Carlo Comparison of Estimators for Censored Regres- 
sion Models." Journal of Econometrics 24:197-213. 


498 References 


Parke, W. R. 1982. "An Algorithm for FIML and 3SLS Estimation of Large Nonlinear 
Models." Econometrica 50:81—95. 

Pearson, E. S., and N. W. Please. 1975. "Relation between the Shape of Population 
Distribution and the Robustness of Four Simple Statistical Tests." Biometrika 
62:223-241. 

Pesaran, M. H. 1982. "Comparison of Local Power of Alternative Tests of Non-Nested 
Regression Models." Econometrica 50:1287 —1305. 

Pfanzagl, J. 1973. “Asymptotic Expansions Related to Minimum Contrast Estima- 
tors." Annals of Statistics 1:993— 1026. 

Phillips, P. C. B. 1977. “Approximations to Some Finite Sample Distributions Asso- 
ciated with a First-Order Stochastic Difference Equation." Econometrica 
45:463—485. 

— ——— 1982. “On the Consistency of Nonlinear FIML." Econometrica 50:1307- 
1324. 

1983. “Exact Small Sample Theory in the Simultaneous Equations Model," in 
Z. Griliches and M. D. Intrilligator, eds., Handbook of Econometrics, 2:449 — 516. 
Amsterdam: North-Holland Publishing. 

Pierce, D. A. 1971. “Least Squares Estimation in the Regression Model with Autore- 
gressive-Moving Average Errors.” Biometrika 58:299 312. 

Plackett, R. L. 1960. Principles of Regression Analysis. London: Oxford University 
Press. 

1965. '*A Class of Bivariate Distributions." Journal of American Statistical 
Association 60:516 —522. 

Poirier, D. J. 1978. "The Use of the Box-Cox Transformation in Limited Dependent 
Variable Models." Journal of the American Statistical Association 73:284 —287. 

Powell, J. L. 1981. *Least Absolute Deviations Estimation for Censored and Trun- 
cated Regression Models." Technical Report no. 356, Institute for Mathematical 
Studies in the Social Sciences, Stanford University, Calif. 

1983. "Asymptotic Normality of the Censored and Truncated Least Absolute 
Deviations Estimators." Technical Report no. 395, Institute for Mathematical 
Studies in the Social Sciences, Stanford University, Calif. 

Powell, M. J. D. 1964. “An Efficient Method for Finding the Minimum of a Function 
of Several Variables without Calculating Derivatives." Computer Journal 7:115- 
162. 

Powers, J. A., L. C. Marsh, R. R. Huckfeldt, and C. L. Johnson. 1978. “A Comparison 
of Logit, Probit and Discriminant Analysis in Predicting Family Size." American 
Statistical Association Proceedings of the Social Statistics Section, pp. 693—697. 

Prais, S. J., and H. S. Houthakker. 1955. The Analysis of Family Budgets. Cambridge: 
Cambridge University Press. 

Pratt, J. W. 1981. *Concavity ofthe Log Likelihood." Journal of the American Statis- 
tical Association 76:103- 106. 


References 499 


Press, S. J., and S. Wilson. 1978. "Choosing Between Logistic Regression and Discrim- 
inant Analysis." Journal of the American Statistical Association 73:699 —705. 

Quandt, R. E. 1958. “The Estimation ofthe Parameters ofa Linear Regression System 
Obeying Two Separate Regimes.” Journal ofthe American Statistical Association 
53:873- 880. 

— —— 1982. ‘Econometric Disequilibrium Models." Econometric Reviews 1:1 —63. 

1983. “Computational Problems and Methods," in Z. Griliches and M. D. 
Intrilligator, eds., Handbook of Econometrics, 1:699 —764. Amsterdam: North- 
Holland Publishing. 

Quandt, R. E., and J. B. Ramsey. 1978. "Estimating Mixtures of Normal Distributions 
and Switching Regressions." Journal of the American Statistical Association 
73:730 - 738. 

Radner, R., and L. S. Miller. 1970. "Demand and Supply in U. S. Higher Education: A 
Progress Report." American Economic Review— Papers and Proceedings 
60:326—334. 

Rao, C. R. 1947. "Large Sample Tests of Statistical Hypotheses Concerning Several 
Parameters with Applications to Problems of Estimation." Proceedings of the 
Cambridge Philosophical Society 44:50-57. 

1965. “The Theory of Least Squares When the Parameters Are Stochastic and 

Its Applications to the Analysis of Growth Curves.” Biometrika 52:447 -458. 

1970. “Estimation of Heteroscedastic Variances in a Linear Model." Journal 

of the American Statistical Association 65:161 — 172. 

1973. Linear Statistical Inference and Its Applications, 2d ed. New York: John 
Wiley & Sons. 

Reece, W. S. 1979. "Charitable Contributions: The New Evidence on Household 
Behavior.” American Economic Review 69:142-151. 

Rice, P., and V. K. Smith. 1977. *An Econometric Model of the Petroleum Industry." 
Journal of Econometrics 6:263 — 288. 

Roberts, R. B., G. S. Maddala, and G. Enholm. 1978. “Determinants of the Requested 
Rate of Return and the Rate of Return Granted in a Formal Regulatory Process." 
Bell Journal of Economics 9:611-621. 

Robinson, P. M. 1982a. “On the Asymptotic Properties of Estimators of Models 
Containing Limited Dependent Variables." Econometrica 50:27 -41. 

1982b. “Analysis of Time Series from Mixed Distributions." Annals of Statis- 
tics 10:915—925. 

Rosenberg, B. 1973. “The Analysis of a Cross Section of Time Series by Stochastically 
Convergent Parameter Regression." Annals of Economic and Social Measure- 
ment 2:399 - 428. 

Rosenzweig, M. R. 1980. “Neoclassical Theory and the Optimizing Peasant: An Econ- 
ometric Analysis of Market Family Labor Supply in a Developing Country." 
Quarterly Journal of Economics 94:31 —55. 


500 References 


Royden, H. L. 1968. Real Analysis, 2d ed. New York: Macmillan. 

Ruppert, D., and R. J. Carroll. 1980. “Trimmed Least Squares Estimation in the 
Linear Model.” Journal of the American Statistical Association 75:828 - 838. 

Sant, D. T. 1978. “Partially Restricted Reduced Forms: Asymptotic Relative Effi- 
ciency.” International Economic Review 19:739-747. 

Sargent, T. J. 1978. “Estimation of Dynamic Labor Schedules under Rational Expec- 
tations.” Journal of Political Economy 86:1009- 1044. 

Sawa, T., and T. Hiromatsu. 1973. “Minimax Regret Significance Points for a Prelimi- 
nary Test in Regression Analysis.” Econometrica 41:1093-1101. 

Scheffé, H. 1959. The Analysis of Variance. New York: John Wiley & Sons. 

Schlossmacher, E. J. 1973. "An Iterative Technique for Absolute Deviation Curve 
Fitting.” Journal of the American Statistical Association 68:857 — 865. 

Schmee, J., and G. J. Hahn. 1979. *A Simple Method for Regression Analysis with 
Censored Data." Technometrics 21:417 -432. 

Schmidt, P., and R. Sickles. 1977. “Some Further Evidence on the Use of the Chow 
Test under Heteroscedasticity." Econometrica 45:1293— 1298. 

Schwarz, G. 1978. “Estimating the Dimension of a Model." Annals of Statistics 
6:461 —464. 

Sclove, S. L. 1973. “Least Squares Problems with Random Regression Coefficients.” 
Technical Report no. 87, Institute for Mathematical Studies in the Social 
Sciences, Stanford University, Calif. 

Sclove, S. L., C. L. Morris, and R. Radhakrishnan. 1972. *Non-Optimality of Prelimi- 
nary-Test Estimators for the Mean of a Multivariate Normal Distribution." 
Annals of Mathematical Statistics 43:1481 —1490. 

Shiller, R. J. 1978. "Rational Expectations and the Dynamic Structure of Macroeco- 
nomic Models." Journal of Monetary Economics 4:1 -44. 

Shorrocks, A. F. 1976. "Income Mobility and the Markov Assumption." Economic 
Journal 86:566 — 578. 

Silberman, J. I., and G. C. Durden, 1976. “Determining Legislative Preferences on the 
Minimum Wage: An Economic Approach.” Journal of Political Economy 
84:317 - 329. 

Silberman, J. I., and W. K. Talley. 1974. *N-Chotomous Dependent Variables: An 
Application to Regulatory Decision-Making." American Statistical Association 
Proceedings of the Business and Economic Statistics Section, pp. 573-576. 

Silvey, S. D. 1959. “The Lagrangian Multiplier Test." Annals of Mathematical Statis- 
tics 30:389 407. 

1969. “Multicollinearity and Imprecise Estimation." Journal of the Royal 
Society ser. B, 31:539- 552. 

Small, K. A. 1981. "Ordered Logit: A Discrete Choice Model with Proximate Covar- 
iance Among Alternatives." Research Memorandum no. 292, Econometric Re- 
search Program, Princeton University. 


References 501 


Small, K. A., and D. Brownstone. 1982. “Efficient Estimation of Nested Logit Models: 
An Application to Trip Timing.” Research Memorandum no. 296, Econometric 
Research Program, Princeton University. 

Smith, K. C., N. E. Savin, and J. L. Robertson. 1984. “A Monte Carlo Comparison of 
Maximum Likelihood and Minimum Chi-Square Sampling Distributions in 
Logit Analysis.” Biometrics 40:471—482. 

Spitzer, J. J. 1976. “The Demand for Money, the Liquidity Trap, and Functional 
Forms." International Economic Review 17:220 -227. 

1978. “A Monte Carlo Investigation of the Box-Cox Transformation in Small 
Samples.” Journal of the American Statistical Association 73:488 495. 

Srivastava, V. K., and T. D. Dwivedi. 1979. “Estimation of Seemingly Unrelated 
Regression Equations: A Brief Survey." Journal of Econometrics 10:15-32. 

Stapleton, D. C., and D. J. Young. 1984. “Censored Normal Regression with Measure- 
ment Error on the Dependent Variable." Econometrica 52:737 —760. 

Stein, C. 1973. “Estimation of the Mean of a Multivariate Normal Distribution." 
Technical Report no. 48, Department of Statistics, Stanford University, Calif. 

Stephenson, S. P., and J. F. McDonald. 1979. “Disaggregation of Income Mainte- 
nance Impacts on Family Earnings." Review of Economics and Statistics 61:354 — 
360. 

Stigler, S. M. 1973. "Simon Newcomb, Percy Daniell, and the History of Robust 
Estimation, 1885-1920." Journal of the American Statistical Association 
68:872-879. 

1977. *Do Robust Estimators Work with Real Data?" Annals of Statistics 
5:1055— 1098. 

Strawderman, W. E. 1978. “Minimax Adaptive Generalized Ridge Regression Esti- 
mators.” Journal of the American Statistical Association 73:623—627. 

Strickland, A. D., and L. W. Weiss. 1976. “Advertising, Concentration, and Price-Cost 
Margins.” Journal of Political Economy 84:1109 - 1121. 

Strotz, R. H. 1960. “Interdependence as a Specification Error." Econometrica 
28:428 -442. 

Swamy, P. A. V. B. 1970. Efficient Inference in a Random Coefficient Regression 
Model." Econometrica 38:311 —323. 

1980. “A Comparison of Estimators for Undersized Samples." Journal of 
Econometrics 14:161 -181. 

Swamy, P. A. V. B., and J. S. Mehta. 1977. "Estimation of Linear Models with Time 
and Cross-Sectionally Varying Parameters." Journal of the American Statistical 
Association 72:890—891. 

Taylor, W. E. 1978. “The Heteroscedastic Linear Model: Exact Finite Sample Re- 
sults.” Econometrica 46:663—676. 

1980. "Small Sample Considerations in Estimation from Panel Data.” Journal 

of Econometrics 13:203-223. 


502 References 


1981. "On the Efficiency of the Cochrane-Orcutt Estimator.” Journal of Econ- 

ometrics 17:67 - 82. 

1983. “On the Relevance of Finite Sample Distribution Theory.” Econometric 
Reviews 2:1 - 39. 

Taylor, W. F. 1953. “Distance Functions and Regular Best Asymptotically Normal 
Estimates." Annals of Mathematical Statistics 24:85-92. 

Telser, L. G. 1963. “Least-Squares Estimates of Transition Probabilities,” in C. Christ, 
ed., Measurement in Economics, pp. 270-292. Stanford, Calif.: Stanford Univer- 
sity Press. 

Theil, H. 1953. “Repeated Least-Squares Applied to Complete Equation Systems.” 
Mimeographed paper. The Hague: Central Planning Bureau. 

—— — 1961. Economic Forecasts and Policy, 2d ed. Amsterdam: North-Holland 
Publishing. 

1971. Principles of Econometrics. New York: John Wiley & Sons. . 

Theil, H., and A. S. Goldberger. 1961. “On Pure and Mixed Statistical Estimation in 
Economics.” International Economic Review 2:65-78. 

Thisted, R. A. 1976. “Ridge Regression, Minimax Estimation, and Empirical Bayes 
Method.” Technical Report no. 28, Division of Biostatistics, Stanford University, 
Calif. 

Tobin, J. 1958. “Estimation of Relationships for Limited Dependent Variables.” 
Econometrica 26:24 — 36. 

Toikka, R. S. 1976. “A Markovian Model of Labor Market Decisions by Workers.” 
American Economic Review 66:821 —834. 

Tomes, N. 1981. “The Family, Inheritance, and the Intergenerational Transmission of 
Inequality.” Journal of Political Economy 89:928 958. 

Toyoda, T. 1974. “Use of the Chow Test Under Heteroscedasticity.” Econometrica 
42:601- 608. 

Tsiatis, A. A. 1981. “A Large Sample Study of Cox's Regression Model.” The Annals of 
Statistics 9:93— 108. 

Tsurumi, H. 1970. “Nonlinear Two-Stage Least Squares Estimation of CES Produc- 
tion Functions Applied to the Canadian Manufacturing Industries." Review of 
Economics and Statistics 52:200 -207. 

Tuma, N. B. 1976. “Rewards, Resources, and the Rate of Mobility: A Nonstation- 
ary Multivariate Stochastic Model.” American Sociological Review 41:338- 
360. 

Tuma, N. B., M. T. Hannan, and L. P. Groeneveld. 1979. “Dynamic Analysis of Event 
Histories." Journal of Sociology 84:820 854. 

Uhler, R. S. 1968. “The Demand for Housing: An Inverse Probability Approach.” 
Review of Economics and Statistics 50:129 — 134. 

Vinod, H. D. 1978. “A Ridge Estimator Whose MSE Dominates OLS.” International 
Economic Review 19:727-737. 

Wald, A. 1943. “Tests of Statistical Hypotheses Concerning Several Parameters When 


References 503 


the Number of Observations Is Large.” Transactions of the American Mathemati- 

cal Society 54:426 - 482. 

1949. “Note on the Consistency of the Maximum Likelihood Estimate.” 
Annals of Mathematical Statistics 60:595 -601. 

Wales, T. J., and A. D. Woodland. 1980. “Sample Selectivity and the Estimation of 
Labor Supply Functions.” International Economic Review 21:437 —468. 

Walker, S. H., and D. B. Duncan. 1967. "Estimation of the Probability ofan Event asa 
Function of Several Independent Variables.” Biometrika 54:167 — 179. 

Wallace, T. D., and A. Hussain. 1969. “The Use of Error Components Models in 
Combining Cross-Section with Time Series Data.” Econometrica 37:55-72. 
Wallis, K. F. 1980. "Econometric Implications of the Rational Expectations Hypoth- 

esis.” Econometrica 48:49-73. 
Warner, S. L. 1962. Stochastic Choice of Mode in Urban Travel —4A Study in Binary 
Choice. Evanston, Ill.: Northwestern University Press. 
1963. “Multivariate Regression of Dummy Variates Under Normality As- 
sumptions.” Journal of the American Statistical Association 58:1054—1063. 
Watson, G. S. 1955. "Serial Correlation in Regression Analysis, L" Biometrika 
42:327 - 341. 

Welch, B. L. 1938. “The Significance of the Difference between Two Means When the 
Population Variances Are Unequal.” Biometrika 29:350 - 362. 

Westin, R. B. 1974. “Predictions from Binary Choice Models." Journal of Economet- 
rics 2:1-16. 

Westin, R. B., and P. W. Gillen. 1978. "Parking Location and Transit Demand." 
Journal of Econometrics 8:75 — 101. 

White, H. 1980a. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and 
a Direct Test for Heteroskedasticity." Econometrica 48:817 - 838. 

1980b. “Nonlinear Regression on Cross-Section Data.” Econometrica 

48:721-746. 

1982. "Instrumental Variables Regression with Independent Observations." 

Econometrica 50:483 —499. 

, ed. 1983. "Non-Nested Models." Journal of Econometrics 21:1— 160. 

White, H., and G. M. MacDonald. 1980. “Some Large Sample Tests for Nonnor- 
mality in the Linear Regression Model." Journal of the American Statistical 
Association 75:16-28. 

White, K. J. 1972. Estimation of the Liquidity Trap with a Generalized Functional 
Form." Econometrica 40:193- 199. 

Whittle, P. 1983. Prediction and Regulation. Minneapolis: University of Minnesota 
Press. 

Wiggins, S. N. 1981. "Product Quality Regulation and New Drug Introductions: Some 
New Evidence from the 1970's." Review of Economics and Statistics 63:615 —619. 

Willis, R. J., and S. Rosen. 1979. “Education and Self-Selection." Journal of Political 
Economy 87:87 -S36. 


504 References 


Witte, A. D. 1980. “Estimating the Economic Model of Crime with Individual Data.” 
Quarterly Journal of Economics 94:57-84. 

Wu, C. F. J. 1983. “On the Convergence Properties of the EM Algorithm.” Annals of 
Statistics 11:95— 103. 

Wu, D. M. 1965. “An Empirical Analysis of Household Durable Goods Expenditure.” 
Econometrica 33:761 —780. 

Zacks, S. 1971. Theory of Statistical Inference. New York: John Wiley & Sons. 

Zarembka, P. 1968. “Functional Form in the Demand for Money." Journal of the 
American Statistical Association 63:502 —511. 

Zellner, A. 1962. "An Efficient Method of Estimating Seemingly Unrelated Regres- 
sions and Tests for Aggregation Bias." Journal of the American Statistical Associ- 
ation 57:348 - 368. 

1971. An Introduction to Bayesian Inference in Econometrics. New York: John 
Wiley & Sons. 

Zellner, A., D. S. Huang, and L. C. Chau. 1965. “Further Analysis of the Short-Run 
Consumption Function with Emphasis on the Role of Liquid Assets.” Economet- 
rica 40:193—199. 

Zellner, A., and H. Theil. 1962. “Three-Stage Least Squares: Simultaneous Estimation 
of Simultaneous Equations.” Econometrica 30:54-78. 


Name index 


Abraham, B., 319 

Abramovitz, M., 193 

Adams, J. D., 365 

Adelman, I. G., 420 

Aigner, D. J., 67, 69, 123 

Aitchison, J., 307 

Akahira, M., 136 

Akaike, H., 52, 146, 147, 466 

Albert, A., 271 

Albright, R. L., 309, 310 

Almon, S., 178, 469 

Amemiya, T., 40, 52, 123, 146, 147, 161, 
178, 179, 189, 195, 199, 202, 203, 204, 
205, 206, 211, 215, 217, 218, 222, 245, 
246, 248, 249, 251, 254, 257, 258, 259, 
264, 269, 278, 285, 286, 295, 296, 298, 
299, 300, 302, 303, 304, 305, 306, 321, 
322, 328, 408, 465, 468, 470, 471, 472 

Anderson, J. A., 271 

Anderson, T. W., 10, 159, 172, 173, 174, 
180, 183, 187, 216, 221, 235, 238, 239, 
240, 417, 469, 470, 471 

Andrews, D. F., 72, 73, 75 

Anscombe, F. J., 203 

Apostol, T. M., 470 

Arabmazer, A., 380, 381 

Arrow, K. J., 128 

Ashenfelter, O., 365 

Ashford, J. R., 317, 319 


Baldwin, K. F., 64, 69 
Balestra, P., 211, 213, 214, 215 
Baltagi, B. H., 211 

Baranchik, A. J., 63 

Barankin, E. W., 125, 433, 470 
Bartholomew, D. J., 412, 418 
Basmann, R. L., 239, 470 
Bassett, G., 77, 78, 154, 468 
Battese, G. E., 211, 212 

Beach, C. M., 176, 190 
Beattie, B. R., 67 


Bellman, R., 416, 459, 474 
Belsley, D. A., 264 

Bennett, J., 307 

Benus, J., 252 

Bera, A. K., 382 

Berger, J. O., 65 

Berkson, J., 275, 278, 280 
Berndt, E. R., 138, 145 
Berzeg, K., 214 
Bhattacharya, P. J., 65 
Bhattacharya, R. K., 92 
Bianchi, C., 262, 263 
Bickel, P. J., 71, 72, 73, 75, 154, 203, 465, 466 
Bishop, Y. M. M., 316 
Blattberg, R., 77 

Blumen, I., 418, 472 

Bock, M. E., 69, 466 
Bodkin, R. G., 141, 468 
Boes, D. C., 466 

Borjas, G. J., 405, 406 
Boskin, M. J., 383, 421, 424 
Box, G. E. P., 172, 249, 250, 465 
Breusch, T. S., 145, 206-207, 469 
Brillinger, D. R., 159 
Brook, R. J., 54 

Brown, B. W., 256, 263 
Brown, M., 128 

Brown, R. L., 222 

Brown, W. G., 67 
Brownstone, D., 304 
Brundy, J. M., 242 

Buse, A., 206 

Butler, J. S., 350 


Calzolari, G., 262, 263 
Carroll, R. J., 78, 154, 202 
Carvalho, J. L., 159, 178 
Chamberlain, G., 199, 217 
Champernowne, D. G., 420 - 
Charatsis, E. G., 141 


506 Name Index 


Chau, L. C., 246 

Chenery, H. B., 128 

Chow, G. C., 234, 264, 465 

Christ, C. F., 228 

Christensen, L. R., 128 

Chung, K. L., 90, 466, 472 

Clark, C., 309 

Cochrane, D., 189-190 

Cooley, T. F., 222 

Cooper, J. M., 469 

Cosslett, S. R., 321, 322, 326, 331, 332, 334, 
336, 337, 338, 339, 346, 358, 471, 472 

Cox, D. R., 147, 249, 250, 251, 278, 307, 
449, 450, 465, 466 

Cragg, J. G., 410, 472 

Cramér, H., 93, 118 

Crowder, J. M., 178 

Cullen, D., 395 


Daganzo, C. F., 309 

Dagenais, M. G., 264 

Dahm, P. A., 293 

David, J. M., 293 

Davidson, W. C., 138 

Davis, L., 279 

Deacon, R., 293, 294, 471 

Dempster, A. P., 66, 375 

Dent, W. T., 469 

Dhrymes, P. J., 178 

Diewert, W. E., 128 

Doksum, K. A., 71, 465, 466 

Domencich, T. A., 269 

Doob, J. L., 159 

Draper, N. R., 137, 251, 466 

Dubin, J. A., 408 

Dudley, L., 387 

Duncan, D. B., 275 

Duncan, G. M., 405, 406, 407, 408, 473 

Duncan, G. T., 421 

Durbin, J., 191, 192, 193, 194, 196, 222, 
224, 225, 233 

Durden, G. C., 312 

Dwivedi, T. D., 198 


Efron, B., 62, 63, 64, 67, 135, 283, 284, 455 
Ehrlich, I., 252 

Eicker, F., 199 

Eisenpress, H., 264 

Engle, R. F., 142 

Enholm, G., 394, 395 

Evans, J. M., 222 


Fair, R. C., 78, 264, 265, 365, 403, 473 

Farebrother, N. W., 53, 469 

Feinberg, S. E., 316 

Feller, W., 274 

Ferguson, T. S., 125 

Fisher, F. M., 229, 256, 469, 470 

Fisher, R. A., 16 

Fletcher, R., 138 

Flinn, C. J., 446, 448, 457 

Fomby, T. B., 59 

Forsythe, A. B., 77 

Freund, J. F., 466 

Froehlich, B. R., 469 

Fuller, W. A., 159, 161, 178, 195, 202, 211, 
212, 238, 239 


Gallant, A. R., 113, 136, 137, 141, 145, 257, 
262 

Gastwirth, J. L., 74 

Gaver, K. M., 49 

Geisel, M. S., 49 

Ghosh, J. K., 125, 279 

Gillen, P. W., 389 

Gnedenko, B. V., 90 

Godfrey, L. G., 469 

Goldberg, S., 166 

Goldberger, A. S., 25, 360, 367, 380, 381, 472 

Goldfeld, S. M., 37, 38, 137, 138, 204, 206, 
207, 256, 266, 354, 470 

Goodman, L. A., 315, 316, 417, 418, 419 

Gourieroux, C., 273, 472 

Gradshteyn, I. S., 474 

Granger, C. W. J., 159 

Graybill, F. A., 466 

Greene, W. H., 368, 472 

Greenstadt, J., 264 

Grenander, U., 161 

Grether, D. M., 159, 178 

Griliches, Z., 178, 217 

Groeneveld, L. P., 444 

Gronau, R., 388, 473 

Guilkey, D. K., 317 

Gumbel, E. J., 319 

Gunst, R. F., 66 

Gurland, J., 124, 293, 433, 470 


Haberman, S. J., 316 
Haessel, W., 258 
Hahn, G. J., 378 
Hall, B. H., 138 
Hall, R. E., 138 


Halmos, D. R., 467 

Ham, J., 365 

Hammerstrom, T., 203 

Hampel, F. R., 72, 73, 75 

Han, A. K., 455 

Hannan, M. T., 444 

Hartley, H. O., 140, 141 

Hartley, M. J., 375, 403, 472 

Harvey, A. C., 159, 172, 189, 222, 242 

Hatanaka, M., 178, 265 

Hause, J. C., 217 

Hausman, J. A., 138, 145, 217, 218, 233, 
299, 302, 304, 308, 309, 310, 381, 395 

Heckman, J. J., 252, 348, 349, 350, 351, 353, 
354, 368, 369, 387, 389, 390, 391, 392, 
395, 396, 397, 401, 402, 412, 446, 448, 
457, 473, 474 

Heien, D., 128 

Hettmansperger, T. P., 79 

Hildreth, C., 204, 469 

Hill, R. C., 59 

Hill, R. W., 75, 77 

Hinkley, D. V., 222, 251, 465, 466 

Hiromatsu, T., 53-54 

Hoadley, B., 117 

Hodges, J. L., 61, 74, 79, 124 

Hoel, P. G., 466 

Hoerl, A. E., 56, 60, 64 

Hogg, R. J., 72, 74 

Holford, T. R., 437 

Holland, P. W., 75, 77, 316 

Holly, A., 145, 262 

Hood, W. C., 470 

Horowitz, J. L., 309 

Hosek, J. R., 328 

Houck, J. P., 204, 469 

Houthakker, H. S., 198, 203 

Howe, H., 128 

Hsiao, C., 216, 218, 220, 221, 469, 470 

Hsieh, D., 321, 328 

Huang, D. S., 246, 388 

Huber, P. J., 71, 72, 73, 75, 105, 468 

Huckfeldt, R. R., 300 

Hurd, M., 379 

Hussain, A., 210 


Imhoff, J. P., 173, 192 
Jaeckel, L. A., 78, 79 


Jaffee, D. M., 403, 473 
James, W., 56, 61-62 


Name Index 507 


Jarque, C. M., 382 

Jenkins, G. M., 172 

Jennrich, R. L, 107, 117, 468 

Jobson, J. D., 202 

Johnson, C. L., 300 

Johnson, N. L., 62, 296, 300, 463, 472 
Johnson, S. R., 59 

Johnston, J., 238 

Joreskog, K. G., 217 

Jorgenson, D. W., 128, 242, 257, 258, 262 
Judge, G. G., 67, 69, 466 


Kahn, L. M., 310, 311 

Kakwani, N. C., 198 

Kalbfleisch, J. D., 412, 451, 455, 474 
Kariya, T., 198, 202 

Kay, R., 455 

Keeley, M. C., 365 

Kelejian, H. H., 218, 219, 220, 246, 249 
Kendall, M. G., 36 

Kennard, R. W., 56, 60, 64 

Kenny, L. W., 396, 397, 404 

Kiefer, J., 331, 346, 347 

Klein, L. R., 141, 468 

Kmenta, J., 186, 252 

Koenker, R., 72, 77, 78, 154, 207, 468 
Kogan, M., 418, 472 

Kolmogorov, A. N., 90 

Koopmans, T. C., 470 

Kotlikoff, L. J., 364, 365 

Kotz, S., 62, 296, 300, 463, 472 
Koyck, L. M., 178 


Lachenbruch, P. A., 285 

Laffont, J., 257, 472 

Lai, T. L., 467 

Laird, N. M., 375 

Lancaster, T., 444, 445, 446, 448, 474 

Lau, L. J., 128, 258, 262 

LeCam, L., 124, 337 

Lee, L. F., 293, 318, 319, 382, 387, 392, 394, 
396, 397, 400, 402, 403, 404, 405 

Legg, W. E., 293 

Lehmann, E. L., 61, 74, 79 

Lermann, S. R., 309, 310, 321, 322, 327, 471 

Li, M. M., 280 

Lillard, L. A., 216 

Lin, L. G., 421 

Lindley, D. V., 465 

Loeve, M., 466 


508 Name Index 


McCall, J. J., 418, 419 

McCarthy, P. J., 418, 472 

MacDonald, G. M., 465 

McDonald, J. F., 365 

McFadden, D., 269, 278, 285, 286, 295, 296, 
298, 299, 300, 302, 303, 304, 305, 306, 
321, 322, 328, 408, 471, 472 

McGillivray, R. G., 283 

McKean, J. W., 79 

MacKinnon, J. G., 176, 190 

MacRae, E., 431 

MaCurdy, T. E., 217, 218, 257, 473 

Maddala, G. S., 213, 214, 216, 361, 392, 
394, 395, 396, 397, 402, 404 

Malik, H. J., 319 

Malinvaud, E., 225, 228 

Mallows, C. L., 52 

Mandelbrot, B., 71 

Mann, H. B., 87, 88, 89 

Manski, C. F., 309, 310, 321, 322, 327, 328, 
339, 343, 346, 471, 472 

Marcus, M., 462 

Mariano, R. S., 238, 263 

Marquardt, D. W., 140 

Marsh, L. C., 300 

Mason, R. L., 66 

Mayer, T., 51 

Mehta, J. S., 218, 221 

Miller, L. S., 356 

Miller, R. G., 412 

Minc, H., 462 

Minhas, B. S., 128 

Mizon, G. E., 128, 141, 145 

Moffitt, R., 350 

Monfort, A., 273, 472 

Montmarquette, C., 387 

Mood, A. M., 466 

Morimune, K., 179, 310, 311, 318, 319 

Morris, C. L., 62, 63, 64, 66, 67, 68, 466 

Mosteller, F., 74 

Muthén, B., 318 


Nagar, A. L., 242 

Nakamura, A., 395 

Nakamura, M., 395 

Nelder, J. A., 203 

Nelson, F. D., 381, 382, 389, 396, 397, 398, 
402 

Nerlove, M., 159, 178, 211, 213, 214, 215, 
216, 317 


Newbold, P., 159 
Neyman, J., 120, 155 
Nickell, S., 447 
Nold, F. C., 421, 424 
Norden, R. H., 118 


Oberhofer, W., 186 
Olsen, R. J., 373, 387, 396, 397, 398 
Orcutt, G. H., 189-190 


Paarsch, H. J., 383 

Pagan, A. R., 145, 206 -207, 469 
Parke, W. R., 264, 265 

Payne, C., 67 

Pearson, E. S., 465 

Pesaran, M. H., 148 

Pfanzagl, J., 135, 468 

Phillips, P. C. B., 173, 238, 260, 266 
Plackett, R. L., 192, 319, 463, 464 
Please, W. W., 465 

Poirier, D. J., 123, 251 

Polachek, S., 252 

Powell, J. L., 251, 252, 284, 382, 383 
Powell, M. J. D., 138, 141 

Powers, J. A., 300 

Prais, S. J., 198, 203 

Pratt, J. W., 293 

Prentice, R. L., 412, 451, 474 
Prescott, E. C., 222 

Press, S. J., 284, 317 


Quandt, R. E., 37, 38, 120, 137, 138, 141, 
204, 206, 207, 222, 256, 266, 354, 403, 470 


Radhankrishnan, R., 67, 68, 466 

Radner, R., 356 

Ramsey, J. B., 120 

Rao, C. R., 58, 87, 89, 90, 91, 92, 93, 94, 
116, 118, 124, 138, 142, 183, 337, 407, 
463, 464, 465, 466, 469 

Reece, W. S., 365 

Reid, F., 278, 286 

Revo, L. T., 285 

Rice, P., 470 

Robbins, H., 467 

Robins, P. K., 365 

Robinson, P. M., 380 

Roberts, R. B., 394, 395 

Robertson, J. L., 280 

Rogers, W. H., 72, 73, 75 


Rosen, S., 404, 405 
Rosenberg, B., 222 
Rosenzweig, M. R., 365 
Royden, H. L., 19, 117 
Rubin, D. B., 375 

Rubin, H., 235 

Ruppert, D., 78, 154, 202 
Ryzhik, I. M., 474 


Sahay, S. N., 242 

Sant, D. T., 242 

Sargent, T. J., 77, 141 
Savin, N. E., 145, 280 
Sawa, T., 53-54, 240, 470 
Schatzoff, M., 66 

Scheffe, H., 30 
Schlossmacher, E. J., 78 
Schmee, J., 378 

Schmidt, P., 37, 317, 319, 380, 381 
Schwartz, G., 79, 466 
Sclove, S. L., 60, 64, 67, 68, 466 
Scott, E. L., 120, 155 
Segun, I. A., 193 

Shapiro, H., 252 

Shapiro, P., 293, 294, 471 
Shiller, R. J., 178 
Shorrocks, A. F., 418, 420 
Sickles, R., 37 

Silberman, J. I., 293, 312 
Silvey, S. D., 142, 466 
Singer, B., 412, 474 
Sinha, B. K., 279 

Small, K. A., 304, 306 
Smith, H., 137 

Smith, K. C., 280 

Smith, V. K., 470 
Sneeringer, C., 285 
Solow, R. M., 128 
Sorbom, D., 217 

Sowden, R. R., 317 
Sparmann, J. M., 309 
Spiegelman, R. G., 365 
Spitzer, J. J., 252 
Srivastava, V. K., 198 
Stapleton, D. C., 472 
Stein, C., 56, 61-62, 470 
Stephan, S. W., 218, 219, 220 
Stephenson, S. P., 365 
Stigler, S. M., 71, 75 
Strawderman, W. E., 65 


Name Index 509 


Strickland, A. D., 470 

Strotz, R. H., 229 

Stuart, A., 36 

Subramanyam, K., 125, 279 
Swamy, P. A. V. B., 218, 221, 242 
Szego, G., 161 


Talley, W. K., 293 

Taylor, W. E., 189, 201, 213, 217, 218, 238, 
239 

Taylor, W. F., 125, 470 

Telser, L. G., 431 

Theil, H., 6, 24, 25, 49, 236, 239, 241 

Thisted, R. A., 64, 66, 466 

Tiao, G. C., 465 

Tobin, J., 360, 361, 363, 364, 374, 472 

Toikka, R. S., 429, 430, 473 

Tomes, N., 396, 398 

Toyoda, T., 37 

Trost, R. P., 392, 394, 396, 397, 404 

Trotter, H. F., 138 

Tsiatis, A. A., 450 

Tsurumi, H., 470 

Tukey, J. W., 72, 73, 75 

Tuma, N. B., 444, 474 


Uhler, R. S., 300 


Van Nostrand, R. C., 466 
Vinod, H. D., 67 


Wald, A., 28, 87, 88, 89, 118, 142 
Wales, T. J., 128, 373, 375, 390, 472, 473 
Walker, S. H., 275 

Wallace, T. D., 210 

Wallis, K. F., 178 

Warner, S. L., 282, 283 

Watson, G. S., 191, 192, 193, 194, 225 
Wedderburn, R. W. M., 203 

Wei, C. Z., 467 

Weiss, Y., 216, 470 

Welch, B. L., 36 

Wermuth, N., 66 

West, R. W., 365 

Westin, R. B., 285, 286, 389 

White, H., 117, 148, 199, 200, 370, 465 
White, K. J., 252 

Whittle, P., 159, 167, 171, 177 
Wiggins, S. N., 365 


510 Name Index 


Willis, R. J., 216, 348, 349, 350, 351, 353, Young, D. J., 472 
404, 405 
Wilson, S., 284 Zacks, S., 46, 138, 465 
Wise, D. A., 308, 309, 310, 395 Zarembka, P., 251, 252 
Witte, A. D., 365 Zellner, A., 24, 49, 197, 241, 246, 465 


Wolfowitz, J., 331, 346, 347 

Woodland, A. D., 373, 375, 390, 472, 473 
Wu, C. F. J., 375 

Wu, D., 388 


Subject Index 


Absorbing state, 421 

Admissible, 47, 48 

Akaike Information Criterion (AIC): as 
solution to general model selection 
problem, 146-147; as solution to problem 
of selecting regressors, 52; in choosing 
optimal significance level, 54-55; in 
qualitative response model, 280-281, 313 

Almon lag model, 178-179 

Almost sure convergence, 86 

Almost surely uniform convergence, 106 

a-trimmed mean, 71, 73, 78 

Amemiya's least squares and generalized 
least squares, 393 

Asymptotic bias, 95 

Asymptotic distribution, 92 

Asymptotic efficiency, 123-125 

Asymptotic expectation: definition, 94; 
relationship with limit of expectation and 
probability limit, 93-95 

Asymptotic F-test, 37, 38 

Asymptotic likelihood ratio test, 37-38 

Asymptotic mean. See Asymptotic expecta- 
tion 

Asymptotically normal, 92 

Asymptotically unbiased, 95 

Autocovariance: definition, 160; derivation 
in first-order autoregressive (AR(1)) 
model, 163; derivation in second-order 
autoregressive (AR(2)) model, 165-166; 
spectral density as Fourier transform of, 
160. See also Autocovariance matrix 

Autocovariance matrix: characteristic roots, 
161, 169; general form, 160; in first-order 
autoregressive (AR(1)) model, 163-164; in 
second-order autoregressive (AR(2)) 
model, 166-167; in pth order autoregres- 
sive (AR(p)) model, 167; relationship 
between that of moving-average model 
and that of autoregressive model, 171-172 


Autoregressive integrated moving-average 
(ARIMA) process, 172 
Autoregressive model 
First-order (AR(1)): asymptotic 
normality of least squares estimator 
in, 174-175; autocovariance matrix of, 
163-164; consistency of least squares 
estimator in, 173-174; definition, 162; 
maximum likelihood estimators of 
parameters, 175-176; optimal 
predictor in, 177; spectral density of, 
164 
pth order (AR(p)): autocovariance 
matrix of, 167; definition, 167; least 
squares estimators of parameters, 
172-173; moving-average representa- 
tion of, 167-168; properites of, 167-170 
Second-order (AR(2)): autocovariance 
matrix of, 165-167; definition, 164 
Autoregressive model with moving-average 
residuals (ARMA(p,q)): definition, 170; 
optimal predictor in, 177; spectral density 
of, 170 


Balestra-Nerlove model, 215-216 
Bayes estimator: generalized, 48; in classical 
linear regression model, 24-26; proper, 47, 
48 
Bayes's rule, 24, 48, 321 
Bayesian statistics, 23-24, 47, 48-49, 456 
Behrens-Fisher problem, 36 
Berkson's minimum chi-square (MIN 7) 
estimator 
In binary qualitative response model: 
asymptotic distribution of, 276-277; 
chi-square tests based on, 280-281; 
comparison with maximum likelihood 
estimator, 278-279; definition, 276; 
inconsistency of, 278 
In first-order Markov model, 415 


512 Subject Index 


Berkson's minimum chi-square (MIN 7”) 
estimator (continued) 
In multinomial qualitative response 
model: asymptotic distribution of, 
291; definition, 290-291; in the 
Deacon-Shapiro model, 294-295 
In two-state Markov model with 
exogenous variables, 423-424 
Best asymptotically normal (BAN) estima- 
tors, 124-125 
Best linear predictor, 3 
Best linear unbiased estimator (BLUE): 
constrained least squares as, 21, 23; least 
squares as, 4, 11-13 
Best nonlinear three-stage least squares 
(BNL3S) estimator: comparison with non- 
linear full information (NLFT) maximum 
likelihood estimator, 259, 261; computa- 
tion of, 258; definition, 257-258 
Best nonlinear two-stage least squares 
(BNL2S) estimator: asymptotic inferiority 
to NLLI maximum likelihood estimator, 
252, 254-255; computation of, 249; 
definition, 248 
Best predictor, 3, 39 
Best unbiased estimator, 13, 17-20 
Beta-logistic model, 350-351 
Better (Best) estimator, 8-11, 40 
Bootstrap method, 135 
Borel field, 83 
Bore! measurable, 84, 467 
Borel sets, 83 
Box-Cox maximum likelihood estimator, 
250-252 
Box-Cox transformation: applications, 141, 
252; definition, 249; nonlinear two-stage 
least squares (N2LS) estimator in, 250 


Categorical models. See Qualitative response 
models 

Cauchy distribution, 70, 94, 383 

Cauchy-Schwartz inequality: applications, 
98, 130; generalized, 326 

Censored regression models, 360, 364. See 
also Tobit models 

Censoring in duration models: left-, 447-448; 
right-, 434 

Central limit theorems (CLT): for K-depen- 
dent sequence, 175; Liapounov's, 92; 


Lindeberg-Feller's, 92; Lindeberg-Levy's, 
91 
Champernowne process, 420 
Characteristic function, 91, 149 
Characteristic root, 459 
Characteristic vector, 459 
Chebyshev's inequality, 86-87 
Chi-square distribution, 463 
Choice-based sampling, 320, 325-327, 
331-332, 338 
Choice-based sampling maximum 
likelihood estimator (CBMLE): rela- 
tionship with random sampling 
maximum likelihood estimator 
(RSMLE), 321-322; case where the 
density f of the independent variables 
is known and the selection probability 
Q is known, 329; case where f is 
known and Q is unknown, 330; case 
where fis unknown and Q is known, 
337-338; case where fis unknown and 
Q is unknown, 333-337 
Manski-Lerman weighted maximum 
likelihood estimator (WMLE): 
asymptotic normality of, 324-325; 
consistency of, 323-324; definition, 
322-323; modification of, 327-328 
Manski-McFadden estimator (MME): 
alternative interpretation of, 336; 
asymptotic normality of, 331; 
consistency of, 330-331; definition, 330 
Clark's approximation, 309 
Classical linear regression model: assumption 
of normality, 13, 27; definition, 2; with 
linear constraints, 20-26; with stochastic 
constraints, 23-26 
Cochrane-Orcutt transformation, 189-190 
Concentrated likelihood function, 125-127 
Conditional likelihood function: definition, 
474; in Nerlove-Press model, 317; in 
nested logit model, 303 
Conditional mean squared prediction error, 
11, 39, 40 
Conjugate gradient method, 141 
Consistency, 95 
Constrained least squares (CLS) estimator: as 
best linear unbiased estimator (BLUE), 23; 
definition, 21-22; in Almon lag model, 
179; relationship with Bayes estimator 
under stochastic constraints, 25-26 


Convergence, modes of: almost sure, 86; 
almost surely uniform, 106; in distribution, 
85; in mean square, 85; in probability, 85; 
in probability in the generalized sense, 
340; in probability uniformly, 106; in 
probability semi-uniformly, 106; logical 
relationship among, 86; theorems 
concerning, 86-89 

Covariance estimator. See Transformation 
estimator 

Covariance matrix of error terms, 185, 186 

Cox's partial maximum likelihood estimator 
(PMLE): as maximization of the joint 
density of order statistics, 451-452; 
asymptotic distribution of, 450, 453-455; 
definition, 449; intuitive motivation of, 
450-451 

Cox's test: definition, 147-148; in multivar- 
iate qualitative response model, 319 

Cramér-Rao lower bound: applicability 
asymptotically, 123-124; for regression 
coefficients, 17-19; for residual variance, 
18, 19-20; general theorem on, 14-17; 
generalization of, 336-337 

Cumulants, 91 


Decision theory. See Statistical decision theory 

Dependent variables. See Endogenous 
variables 

Determinant as a criterion for ranking 
estimators, 10-11 

DFP iteration: definition, 138; in nonlinear 
simultaneous equations model, 264 

Diffuse natural conjugate, 49 

Discrete models. See Qualitative response 
models 

Discriminant analysis: in binary qualitative 
response model, 281, 282-285; in 
multinominal qualitative response model, 
299-300 

Disequilibrium models, 361, 402-403 

Distributed lag models: Almon, 178-179; 
definition, 177-178; geometric, 178 

Disturbance. See Error term 

Dummy variable regression. See Transfor- 
mation estimator 

Duration, 428, 434-435 

Duration dependence, 445-446. See also 
State dependence 

Duration models 


Subject Index 513 


Nonstationary (semi-Markov): applica- 
tions, 444-447; definition, 442; 
heterogeneity in, 445-446; left-censor- 
ing in, 447-448; proportional hazards 
form of, 449-455 

Stationary: definition, 433-434; duration 
as dependent variable of regression in, 
438-440; maximum likelihood 
estimates of parameters of, 437-438; 
relationship to standard Tobit model, 
435; relationship with Poisson 
distribution, 436; right-censoring in, 
434; when observations are discrete, 
440-442 

Durbin's test: definition, 196-197; equiva- 
lence with Rao's score test, 469 

Durbin-Watson test: asymptotic distribution 
of test statistic, 191; definition, 191; exact 
distribution of test statistic, 191-193; lower 
and upper bounds for, 193-194; when 
lagged endogenous variables are present, 
195-196 


Edgeworth expansion, 93, 135, 173 
Efficient. See Better (Best) estimator 
EM algorithm: definition and convergence 
properties, 375-376; in Tobit model, 
376-378 
Empirical Bayes method, 60-61 
Empirical distribution function, 118, 135 
Endogenous sampling. See Choice-based 
sampling 
Endogenous variables, 2 
Enriched samples, 332 
Equilibrium probabilities: definition, 
415-416; in Boskin-Nold model, 424-425 
Error components model 
Three-error components model (3ECM): 
asymptotic properties of LS and GLS 
in, 209; definition, 208; definition and 
properties of transformation estimator 
in, 209-211; estimation of variances 
in, 211 
Two-error components model (2ECM): 
as between-group and within-group 
regression equations, 212-213; 
definition, 211-212; definition of GLS 
estimator in, 213; definition of 
transformation estimator in, 212; 
maximum likelihood estimation of, 


514 Subject Index 


Error components model (continued) 
213-214; with endogenous regressors, 
217-218; with serially correlated error, 
216-217 

See also Balestra-Nerlove model 

Error rate, 284 

Error term: additive versus multiplicative, 
128, 468; assumptions on density function 
of, 152; constant variance (homoscedas- 
ticity) 2, 127; normality, 2-3, 13, 129; 
serial correlation, 128; serial independence, 
2-3, 13 

Estimable, 58 

Exogenous sampling, 320 

Exogenous sampling maximum likelihood 

estimator (ESMLE): definition, 321-322; 
inconsistency in choice-based sampling 
model, 324 

Exogenous variables, 2 

Exponential family density, 124-125, 203, 279 

Extremum estimators: asymptotic normality 

of, 105, 111-112, 114; consistency of, 
106-111, 114; definition, 105 


F-distribution, 464 
F-test, 28-31, 32-35, 38, 136-137, 184 
Feasible generalized least squares (FGLS) 
estimator, 186 
In linear regression model with 
heteroscedasticity: as a BAN estima- 
tor, in case of general parametric 
heteroscedasticity, 202-203; definition, 
in case of constant variance in subset 
of sample, 200; definition, in case of 
unrestricted heteroscedasticity, 199; 
definition, when variance is a linear 
function of regressors, 206; exact 
distribution, in case of constant var- 
iance in subset of sample, 201-202 
In linear regression model with serial 
correlation: asymptotic deviation from 
GLS in case of lagged endogenous 
variables, 194-195; asymptotic 
equivalence with GLS, 189; 
Cochrane-Orcutt transformation as 
procedure for calculating, 189-190; ef- 
ficiency relative to least squares, 189 
In Markov model when only aggregate 
data are available, 431 
In seemingly unrelated regression (SUR) 
model, 198 


In Toikka's three-state Markov model, 
429-430 
In three-error components model, 211 
In two-error components model, 212, 213 
Iterated, 186 
See also Generalized least squares (GLS) 
estimator 
Fixed effects estimator. See Transformation 
estimator 
Fourier transform, 160 
Full information maximum likelihood 
(FIML) estimator: asymptotic equivalence 
with 3SLS estimator, 242; asymptotic 
normality, 233-234; consistency, 231-233; 
definition, 231-232; instrumental variables 
interpretation of, 233-234 


Gauss-Markov theorem, 11. See also Best 
linear unbiased estimator 

Gauss-Newton method: definition, 139, 140; 
for obtaining BNL3S estimator, 261; for 
obtaining NL2S estimator, 470; Hartley's 
algorithm, 140-141; Marquardt's algo- 
rithm, 140; second round estimator, 
139-140, 141 

Generalized Bayes estimator, 48 

Generalized classical linear estimator. See 
Two-stage least squares (2SLS) estimator 

Generalized extreme value (GEV) distribu- 
tion, 306 

Generalized extreme value (GEV) model, 
306-307 

Generalized least squares (GLS) estimator: as 
best linear unbiased estimator (BLUE), 
182; definition, in case of known, positive 
definite covariance matrix, 181-182; 
definition, in case of singular covariance 
matrix, 185; examples of equivalence with 
least squares, 182-184; in linear regression 
model with unknown covariance matrix, 
197, 199; in three-error components 
model, 210-211; in two-error components 
model, 212-213. See also Feasible 
generalized least squares (FGLS) estimator; 
Partially generalized least squares (PGLS) 
estimator 

Generalized maximum likelihood estimator, 
339, 346-348 

Generalized ridge estimators, 61-66 

Generalized two-stage least squares (G2SLS) 
estimator, 240-241 


Generalized Wald test: definition, 145; in 
nonlinear simultaneous equations model, 
261-262 

Geometric lag model, 178 

Goldfeld-Quandt estimator, 204-205, 206 

Goldfeld-Quandt peak test for homoscedas- 
ticity, 207 

Goodness of fit. See R? 

Gumbel’s type B bivariate extreme value 
distribution, 300 


Hadamard product, 462 

Hausman's specification test: as test for 
independence of irrelevant alternatives 
(IIA), 299, 302; as test for normality in 
Tobit model, 381; asymptotic properties 
of, 145-146; in nonlinear simultaneous 
equations model, 265 

Hazard rate: in duration model, 435; in 
Tobit model, 472 

Heckman's two-step estimator: in standard 
Tobit model, 367-372; in Type 2 Tobit 
model, 386-387; in Type 3 Tobit model, 
390, 392-393; in Type 4 Tobit model, 396; 
in Type 5 Tobit model, 402 

Heterogeneity: in duration model, 445, 
446-447; in Markov model, 414; in panel 
data qualitative response model, 349, 
350-353. See also Mover-stayer model 

Heteroscedasticity: constant variance in a 
subset of the sample, 200-202; definition, 
198; general parametric, 202-203; in 
standard Tobit model, 378, 379-380; tests 
for, 200, 201, 203, 206-207; unrestricted, 
198-200; variance as an exponential 
function of the regressors, 207; variance as 
a linear function of regressors, 204-207 

Hildreth-Houck estimator in a heteroscedas- 
tic regression model: definition, 204, 205, 
206; modifications, 205-206, 469 

Hólder's inequality, 19 

Homoscedasticity, 2 


Idempotent, 460 

Identification: in a linear simultaneous 
equations model, 230; in a nonlinear si- 
multaneous equations model, 256 

Incidental parameters. See Nuisance 
parameters 

Independence of irrelevant alternatives (ITA): 
as characteristic of multinomial logit 
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model, 298-299; nested multinomial logit 
model as correction for, 300, 302; tests for, 
299, 302 

Independent variables. See Exogenous 
variables 

Information matrix, 16 

Initial conditions: in Balestra-Nerlove model, 
215, 216; in first-order autoregressive 
(AR(1)) model, 163; in Markov chain 
model, 413-414; in panel data qualitative 
response model, 352, 353-354; in 
two-error components model with serially 
correlated error, 216, 217 

Instrumental variables (IV) estimator: 
definition, 11-12; FIML estimator as, 
223-224; G2SLS estimator as, 241; in two- 
error components model with endogenous 
regressors, 217-218; two-stage least squares 
as asymptotically best, 239-240 

Iterative methods. See EM algorithm; 
Gauss-Newton method; Method of scoring; 
Newton-Raphson method 


Jackknife estimator, 135-136 
Jackknife method, 135-136 
Jensen's inequality, 116 
Jordan canonical form, 459 


Khinchine's weak law of large numbers 
(WLLN), 102 

Kolmogorov laws of large numbers (LLN), 90 

Koyck lag model, 178 

Kronecker product, 462 


L estimators, 73-74, 77-78 

L, estimators, 72, 73, 77 

Lag operator, 162 

Lagrange multiplier test. See Score test 

Lagrangian interpolation polynomial, 469 

Laplace distribution, 70 

Laws of large numbers (LLN): Khinchine's, 
102; Kolmogorov's number 1 and number 
2, 90; Markov's, 467; strong, 90; weak, 90 

Least absolute deviations (LAD) estimator: 
in classical regression model, 152-154; in 
standard Tobit model, 382-383. See also 
Median 

Least squares (LS) estimator 

In autoregressive model: asymptotic 
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Least squares (LS) estimator (continued) 
normality in first-order case, 174-175; 
asymptotic normality in pth order 
case, 173; consistency in first-order 
case, 173-174; small sample proper- 
ties, 173 

In classical linear regression model: as 
best linear unbiased estimator 
(BLUE), 11-13; as best unbiased 
estimator under normality of error 
terms, 17-20; asymptotic normality of, 
96-98, 99; consistency of, 95-96; 
definition, 4-5; equivalence under 
normality of, with maximum 
likelihood estimator (MLE), 13; 
geometric interpretation of, 5-6; mean 
and variance of, 7-8; of a linear 
combination of regression parameters, 
7, 58; of a subset of regression 
parameters, 6-7; unbiased alternative 
to estimate of variance of error terms, 8 

In duration model with one completed 
spell per individual, 439 

In geometric lag model, 178 

In linear regression model with general 
covariance matrix of error terms: 
asymptotic normality in case of 
known covariance matrix, 185; 
asymptotic normality in case of serial 
correlation, 187-188; consistency, 
184-185; covariance matrix, 182; ex- 
amples of equivalence with GLS, 
183-184; inconsistency in presence of 
lagged endogenous variables and serial 
correlation, 194; relative efficiency as 
compared with GLS, 182-183 

In Markov model when only aggregate 
data are available, 431 

In standard Tobit model: biasedness 
when all observations used, 367-368; 
biasedness when positive observations 
used, 367 

In Toikka's three-state Markov model, 
429, 430 

See also Constrained least squares (CLS) 
estimator; Generalized least squares 
(GLS) estimator. 

Least squares predictor, 39-40 

Least squares residuals, 5, 21, 32-33 

Lebesgue convergence theorem, 117 


Lebesgue measure, 83, 106, 113, 124, 467 
Lebesgue-Stieltjes integral, 467 
Liapounov central limit theorem (CLT), 
90-91, 92 
Likelihood function. See Concentrated 
likelihood function; Conditional likelihood 
function; Maximum likelihood estimator 
(MLE) 
Likelihood ratio test 
As test for homoscedasticity, 201 
In nonlinear regression model, 144, 145 
Under general parametric hypotheses: 
asymptotic distribution of, 142-144; 
definition, 142; small sample 
properties of, 145 
Under linear hypotheses on a linear 
model: definition, 28-32, 32-34, 38; 
relationship with Wald test and score 
test, 144-145 
Limit distribution, 85, 92 
Limited dependent variables model. See 
Tobit model 
Limited information maximum likelihood 
(LIML) estimator: asymptotic distribution, 
237-238; asymptotic equivalence with 
2SLS estimator, 236, 238; definition, 
235-236; exact distribution, 238-239; 
Fuller's modification, 238 
Limited information model, 234-235 
Linear regression model, with general covar- 
iance matrix of error terms, 184 
Linear simultaneous equations model, 
228-229 
Lindeberg-Feller central limit theorem 
(CLT), 90-91, 92 
Lindeberg-Levy central limit theorem (CLT), 
90-92 
Linear constraints: as testable hypotheses, 
27; form of, 20; stochastic, 23-24 
Logistic distribution: Gumbel's bivariate, 
319; Plackett's bivariate, 319; relationship 
with normal distribution, 269 
Logit model 
Binary: definition, 268-269; global 
concavity of likelihood function, 273 
Multinomial: as result of utility-maxi- 
mizing behavior, 296-297; definition, 
295; global concavity of likelihood 
function, 295-296; independence of 
irrelevant alternatives (IIA) in, 298 


Logit model (continued) 
Multivariate: definition, 314; fundamen- 
tal difference between multivariate 
probit model and, 318 
Multivariate nested, 313-314 
Nested: definition, 300-302, 303; 
estimation, 303-304; multiple-level 
forms, 305-306 
Ordered, 292-293 
Sequential, 310 
Universal, 307 
Logit transformation, 278 
Log-linear model: definition, 314, 315; 
estimation of, 317; saturated, 315; 
unsaturated, 315 
Log Weibull distribution, 296 


M estimators, 72-73, 75-77, 105. See also 
Extremum estimators 

Mallow's criterion: definition, 52; optimal 
critical value of F-test implied by, 55 

Markov chain model: definition, 412-413; 
duration in, 428; estimation of, when only 
aggregate data are available, 430-433; 
homogeneous, 414; hypotheses tests in, 
417; maximum likelihood estimates of pa- 
rameters, 417, 422; minimum chi-square 
(MIN 7?) estimates of parameters of, 415, 
423-424; mover-stayer model, 418; 
multi-state (Toikka's model), 428-430; 
nonlinear generalized least squares 
estimates of parameters, 414-415; second- 
order model, 420; stationary, 414; 
two-state (Boskin-Nold model), 421-428 

Markov matrix, 413 

Markov's law of large numbers (LLN), 467 

Maximum likelihood estimator (MLE): 
asymptotic efficiency, 123-124; asymptotic 
normality, 120-121; consistency, 115-116, 
118; definition, 115; in autoregressive 
moving average model (ARMA (p,q)) 
model, 172; in Balestra-Nerlove model, 
216; in binary qualitative response model, 
271-273, 274-275, 278-279; in Boskin- 
Nold model, 425-427; in classical linear re- 
gression model, 13-14, 17-19, 21, 37-38, 
118-119, 121-123; in duration model, 
437-439; in first-order autoregressive 
(AR(1)) model, 175-176; in homogeneous 
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and stationary first-order Markov model, 
417; in Hsiao's random coefficients model, 
220-221; in linear regression model with 
general parametric heteroscedasticity, 
202-203; in linear regression model with 
serial correlation, 190-191; in Markov 
model when only aggregate data are 
available, 433; in multinomial qualitative 
response model, 287, 288-289; in 
two-error components model, 213-214; in 
two-state Markov model with exogenous 
variables, 422, 424; inconsistency in 
certain models, 120; iterative methods for 
obtaining, 137-139; second-order 
efficiency, 124-125, 279; under constraints 
on the parameters, 142. See also Box-Cox 
maximum likelihood estimator; Choice- 
based sampling; Cox's partial maximum 
likelihood estimator (PMLE); Exogenous 
sampling maximum likelihood estimator; 
Full information maximum likelihood 
(FIML) estimator; Generalized maximum 
likelihood estimator in qualitative 
response model; Limited information 
maximum likelihood (LIML) estimator; 
Nonlinear full information (NLFI) 
maximum likelihood estimator; Nonlinear 
limited information (NLLI) maximum 
likelihood estimator; Probit maximum 
likelihood estimator; Random-effect probit 
maximum likelihood estimator; Random 
sampling maximum likelihood estimator 
(RSMLE); Tobit maximum likelihood 
estimator 
Maximum score estimator: consistency, 
340-343, 344-345; definition, 339, 
343-344; relationship with generalized 
maximum likelihood estimator, 347-348 
Mean squared error, 8, 40 
Measurable function, 467 
Measure, 467 
Measure space, 467 
Median 
Population, 148 
Sample: asymptotic normality, 149-151; 
asymptotic variance under normal, 
Laplace, and Cauchy distributions, 
70-71; consistency, 150; definition, 70, 
73, 149. See also Least absolute 
deviations (LAD) estimator 


518 Subject Index 


Method of scoring: conditions for equiva- 
lence with Gauss-Newton method, 203; 
definition, 138; equivalence with nonlinear 
weighted least squares (NLWLS) iteration, 
274-275, 289-290; in random coefficients 
model, 221; when variance is linear 
function of regressors, 206 

Mills’ ratio, 472 

Minimax estimator, 47, 48 

Minimax regret, 47 

Minimini principle, 51 

Minimum chi-square (MIN 7’) method, 275. 
See also Berkson's minimum chi-square 
(MIN x?) estimator 

MINQUE, 469 

Mixed estimator under stochastic constraints, 
25 

Mixture of normal distributions, 72, 77, 
119-120 

Model selection problem, 146 

Modified nonlinear two-stage least squares 
(MN2LS) estimator: definition, 254; 
inconsistency when normality fails to hold, 
255 

Moore-Penrose generalized inverse, 112, 466 

Mover-stayer model, 418-419 

Moving-average (MA) models: autocovar- 
iance matrix of, 171; first-order (MA(1)), 
171-172; spectral density of, 170-171 

Multicollinearity, 56, 59 


Newton-Raphson method: applications, 141; 
as method of calculating FIML estimator, 
234; definition, 137; DFP iteration, 138; 
for obtaining NLFI maximum likelihood 
estimator, 264; method of scoring as 
means of finding MLE, 138; quadratic 
hill-climbing, 138; second-round estima- 
tor, 137, 139 

Nonlinear full information (NLFI) maxi- 
mum likelihood estimator: asymptotic 
properties, 259, 261; definition, 259; 
inconsistency when error terms are not 
normal, 259-260; iterative method, 260-261 

Nonlinear generalized least squares (NLGLS) 
estimator: in first-order Markov model, 
415; in Markov model when only 
aggregate data are available, 431. See also 
Nonlinear weighted least squares 
(NLWLS) estimator 


Nonlinear least squares (NLLS) estimator: 
asymptotic normality of, 132-134; 
consistency of, 129-130; definition, 
127-129; equivalence of, with MLE under 
normality, 129; in Markov model when 
only aggregate data are available, 431, 433; 
in standard Tobit model, 372-373, 378; 
inconsistency in nonlinear simultaneous 
equations model, 245; under general 
parametric constraints, 144; under linear 
constraints, 136-137 

Nonlinear limited information (NLLI) 
maximum likelihood estimator: asympto- 
tic covariance matrix of, 254; definition, 
252-254; inconsistency of, when normality 
fails to hold, 255; iterative methods for 
solving for, 254 

Nonlinear regression model, 127-128 

Nonlinear simultaneous equations models: 
full information case, 255-256; limited 
information case, 252-253 

Nonlinear three-stage least squares (NL3S) 
estimator: asymptotic normality of, 257; 
consistency of, 257; definition (Amemiya), 
257; definition (Jorgenson and Laffont), 
257. See also Best nonlinear three-stage 
least squares (BNL3S) estimator 

Nonlinear two-stage least squares (NL2S) 
estimator: asymptotic normality, 247-248; 
consistency, 246-247; definition, 246, 250; 
in case of Box-Cox transformation, 250, 
251-252. See also Best nonlinear two-stage 
least squares (BNL2S) estimator; Modified 
nonlinear two-stage least squares (MNL2S) 
estimator 

Nonlinear weighted least squares (NLWLS) 
estimator: in binary qualitative response 
model, 274-275; in duration model with 
one completed spell per individual, 440; in 
Markov model when only aggregate data 
are available, 432, 433; in multinomial 
qualitative response model, 289-290; in 
multi-state Markov model with exogenous 
variables, 428-429; in standard Tobit 
model, 372-373, 472; in two-state Markov 
model with exogenous variables, 423. See 
also Method of scoring 

Nonnested models, 147-148 

Nonnormality: in standard Tobit model, 
378, 380-382, 383; robust estimation 
under, 70-71 


Normal discriminant analysis. See Discrimi- 
nant analysis 
Nuisance parameters, 120 


Odds ratio: in Bayesian solution to selection- 
of-regressors problem, 48-49; in binary 
logit model, 278 

One-factor model in panel data qualitative 
response model: definition, 352-353; 
random-effect probit maximum likelihood 
estimator in, 354; transformation 
estimator in, 353-354 

Optimal significance level, 52-55 

Order condition of identifiability, 230-231 

Order relationship, 90-91 

Overidentification, 231 


Panel data qualitative response model: 
definition, 348; heterogeneity, 349-353; 
one-factor, 352-354; specification of initial 
conditions, 352, 353-354; true state 
dependence, 349, 351-353 

Parke's derivative free algorithm, 264 

Partially generalized least squares (PGLS) 
estimator, 199-200 

Posterior distribution, 24 

Posterior probability, 48 

Posterior risk, 47 

Prediction 

In classical linear regression model: best 
linear predictor, 3; best predictor, 3, 
39; conditional mean squared 
prediction error, 39, 40; least squares 
as best linear unbiased predictor, 
39-40; unconditional mean squared 
prediction error, 40 

In nonlinear simultaneous equations 
model, 262-264 

In time series model, 177 

Prediction criterion: definition, 51; optimal 
critical value of F-test implied by, 54, 55 

Pre-test estimator, 67 

Principal components, 58 

Principal components estimator, 58-60 

Probability measure, 82 

Probability of correct classification (PCC), 284 

Probability space, 82 

Probit maximum likelihood estimator: in 
binary qualitative response model, 271, 
273-274; in multinomial qualitative 
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response model, 307-310; in multivariate 
qualitative response model, 317-318; in 
Tobit model, 366, 386. See also Probit 
model 
Probit model 
Binary: definition, 268, 269; global 
concavity of likelihood function, 
273-274 
Multinomial: definition, 307-308; 
independent versus nonindependent, 
308; iterative methods for calculating 
estimators in, 300 
Multivariate: definition, 317, 318; 
fundamental difference between mul- 
tinomial logit model and, 318 
Ordered, 292-293 
Sequential, 310 
Probit transformation, 277 
Projection matrix, 14 
Proportional hazards model, 449-455 


Quadratic hill-climbing, 138 
Qualitative response models, 267 
Binary: definition, 268; hypotheses tests, 
280-282; prediction of aggregate 
proportion, 285-286 
Multinomial: definition, 287; hypotheses 
tests, 291-292; ordered, 292-293; 
unordered, 292, 293 
Multivariate, 311 
Quantal response models. See Qualitative 
response models 


R estimators, 74, 78-79 

R*: definition, 6; test for equality with zero, 
31; Theil's correction, 49-51, 54, 55; 
weakness, 46 

Random coefficients models (RCM's): 
Hildreth-Houck and error components 
models as special cases of, 204, 218; 
Hsiao's model, 220-221; Kelejian and 
Stephan model, 218-220; Swamy-Mehta 
model, 221-222; Swamy's model, 221; 
varying parameter regression model, 222 

Random-effect probit maximum likelihood 
estimator, 354 

Random sampling maximum likelihood 
estimator (RSMLE), 321-322, 329 

Random variables, 81, 83 

Rank condition of identifiability, 230 
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Rao's score test. See Score test 

Reduced form equations, 229 

Regret, 47 

Reverse least squares estimator, 42, 99-101 

Ridge regression: generalized ridge estima- 
tors, 63-66; relationship with empirical 
Bayes method, 60-61; relationship with 
Stein-type estimators, 56; ridge estimators, 
60-66; ridge trace method, 60 

Riemann integal, 84 

Riemann-Stieltjes integral, 84-85 

Risk, 47 

Robust estimation: concept of, 70-72; in 
regression case, 75-79; L estimator, 73-74, 
77-78; L, estimator, 73, 77; M estimator, 
72-73, 75-71, 105; median, 70-71, 73, 
148-151; R estimator, 74, 78-79; trimmed 
mean, 73, 78; Winsorized mean, 73. See 
also Least absolute deviations (LAD) 
estimator 


Sample space, 81-82 

Schwartz's criterion, 79 

Score function, 339 

Score test: as test for homoscedasticity, 
206-207; asymptotic distribution in general 
case, 142, 144; asymptotic distribution in 
nonlinear simultaneous equations model, 
145; definition, 142; equivalence with 
Durbin's test, 469; in linear model with 
linear constraint, 144-145; in nonlinear re- 
gression model with normality, 144; 
relationship with Lagrange multiplier test, 
142 

Second-order efficiency of maximum 
likelihood estimator, 125. See also Best 
asymptotically normal (BAN) estimator 

Seemingly unrelated regression (SUR) 
model, 197-198 

Selection of regressors: Bayesian solution, 
48-49; selection among nested models, 45, 
52-55; selection among nonnested models, 
46-52 

Separate families of hypotheses, 147 

Serial correlation of error terms: as first-order 
autoregressive (AR(1)) process, 188-195; 
definition, 186; in linear regression model 
with lagged endogenous variables, 
194-197; properties of least squares (LS) 
estimator in case of, 184-185, 187-188 


o-algebra, 82 

Spectral density: conditions for existence and 
continuity, 169; definition, 160-161; of 
autoregressive moving average (ARMA 
(p,q)) process, 170; relationship between 
that of moving-average model and that of 
autoregressive model, 170-171; relation- 
ship with characteristic roots of autocovar- 
iance matrix, 161 

Standard linear regression model. See 
Classical linear regression model 

State dependence, 349, 351. See also 
Duration dependence 

Stationary time series: as infinite sum of 
cycles with random coefficients, 161; defi- 
nition (strictly), 159; definition (weakly), 
159 

Statistical decision theory, 46-48 

Stein's estimator, 61-62 

Stein's modified estimator, 63 

Stein's positive rule estimator, 63 

Stieltjes integral, 84-85, 467 

Stochastic order relationship. See Order 
relationship 

Stratified sampling, 471 

Strictly stationary time series, 159 

Strong convergence, 467. See also Almost 
sure convergence 

Strong laws of large numbers (LLN), 90 

Strongly consistent, 95 

Structural change, tests for, 31-35, 36-38 

Structural equations, 229 

Student's z distribution, 463 

Superefficient estimators, 124 

Survival models. See Duration models 

Switching regression model, 120, 411 


t-test, 27-28, 34-35, 36-37, 136, 184, 465 

Test for equality of variances, 35. See also 
Heteroscedasticity 

Tests of general parametric hypotheses: 
generalized Wald statistic, 145; Hausman's 
specification test, 146-147; likelihood ratio 
test, 142-145; score test, 142, 144-145; 
Wald test, 142-145 

Tests of linear hypotheses: F-test (likelihood 
ratio test), 28-31, 32-34, 184, 465; 
generalization of F-test to nonlinear 
model, 136-137; generalization of t-test to 
nonlinear model, 136; t-test, 27-28, 34-35, 


Tests of linear hypotheses (continued) 
36-37, 184, 465; tests for structural 
change, 31-35, 36-38 

Theil’s corrected R?: appraisal of, 49-51, 55; 
optimal critical value of F-test implied by, 
54 

Three-stage least squares (3SLS) estimator, 
241-242. See also Nonlinear three stage 
least squares (NL3S) estimator 

Tobit maximum likelihood estimator: as an 
equilibrium solution of the EM algorithm, 
376-378; asymptotic properties, 373; 
consistency under serial correlation, 380; 
definition, 373; global concavity of 
likelihood function, 373-374; inconsistency 
under heteroscedasticity, 379-380; 
inconsistency under nonnormality, 380-381 

Tobit model: Amemiya's consistent 
estimator, 374-375; applications, 364, 365, 
387-389, 391-392, 394-395, 397-399, 
400-408; as result of utility maximizing 
behavior, 362; classifications, 383-384; 
consistency of estimators under serial cor- 
relation, 378-379, 380; definition, 360; 
disequilibrium models, 402-403; general 
simultaneous equations forms, 385, 
392-393, 396-397; global concavity of 
likelihood function, 373-374; inconsistency 
of estimators, 378-381; tests for nonnor- 
mality, 381-382; Tobin's initial estimator, 
373. See also Heckman's two-step 
estimator; Least absolute deviations (LAD) 
estimator; Least squares (LS) estimator; 
Nonlinear least squares (NLLS) estimator; 
Nonlinear weighted least squares 
(NLWLS) estimator; Probit maximum 
likelihood estimator; Tobit maximum like- 
lihood estimator 

Toeplitz form, 160 

Trace, as a criterion for ranking estimators, 
10, 11 

Transformation estimator: in Balestra- 
Nerlove model, 215-216; in discrete case, 
353-354; in Hsiao's random coefficients 
model, 220; in three-error components 
model, 209-211; in two-error components 
model, 212-213 

Transition probabilities, 413 

Truncated regression model, 360, 364. See 
also Tobit models 
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Two-stage least squares (2SLS) estimator: 
asymptotic distribution, 236-237; 
asymptotic equivalence with LIML 
estimator, 238; definition, 236; exact 
distribution, 238-239: interpretations, 
239-240. See also Generalized two-stage 
least squares (G2SLS) estimator; Nonlinear 
two-stage least squares (N2SLS) estimator 

Two-step method of estimation: in multivar- 
iate nested logit model, 303-304; in nested 
logit model, 314. See also Heckman’s 
two-step estimator 

Type I extreme value distribution, 296 


Unconditional mean squared prediction 
error, 40 

Underidentification, 231 

Uniform convergence of a sequence of 
random variables: almost surely uniform, 
106; in probability semi-uniformly, 106; in 
probability uniformly, 106 

Uniformly smaller risk, 47 


Wald’s test: asymptotic distribution in 
nonlinear simultaneous equations model, 
145; asymptotic distribution, 142, 144; 
definition, 142; in linear model with linear 
constraints, 144-145; in nonlinear 
regression model with normality, 144 

Weak convergence, 467. See also Consist- 
ency; Convergence in probability 

Weak laws of large numbers (LLN), 90 

Weakly consistent, 95 

Weakly stationary time series, 159 

Weibull distribution, 445 

Weighted least squares (WLS) estimator, 
371-372. See also Feasible generalized least 
squares (FGLS) estimator; Generalized 
least squares (GLS) estimator, Nonlinear 
weighted least squares (NLWLS) estimator 

Welch’s method, 36-37 

Winsorized mean, 73 


